[Comparative performance] str-functions vs. mem-functions

Alex Vinokur

unread,

Aug 27, 2003, 11:09:00 AM8/27/03

to

Here are results of comparative performance tests carried out
using the same compiler (gcc 3.2)
in different environments (CYGWIN, MINGW, DJGPP)
on Windows 2000 Professional.

The following C-functions were tested :
* strcpy
* memcpy
* memmove
* memset

The summary results are below.

#################################################################

str-functions vs. mem-functions (C-language)
============================================

C/C++ Performance Tests
=======================
Using C/C++ Program Perfometer
http://sourceforge.net/projects/cpp-perfometer
http://alexvn.freeservers.com/s1/perfometer.html

Environment
-----------
Windows 2000 Professional
* CYGWIN_NT-5.0 1.3.22(0.78/3/2)
* MINGW 2.0.0.-2
* DJGPP 2.03
Intel(R) Celeron(R) CPU 1.70 GHz
GNU g++/gpp 3.2
Compilation : No optimization

================ Performance tests : BEGIN ================

#==========================================================
# Comparison : str-functions vs. mem-functions (C)
#----------------------------------------------------------
# Resource Name : user time used (via rusage)
# Resource Cost Unit : milliseconds (unsigned long long)
# Resource State Unit : timeval
#==========================================================

Summary test results
CYGWIN_NT-5.0 1.3.22(0.78/3/2)
gcc/g++ version 3.2 20020927 (prerelease)
============================
----------------------------------------
| | | User time used for |
| N | Function | string size |
| | |-----------------------|
| | | 10 | 100 | 1000 |
|--------------------------------------|
| 1 | strcpy | 46 | 165 | 1088 |
| 2 | memcpy | 166 | 223 | 1069 |
| 3 | memmove | 215 | 274 | 1124 |
| 4 | memset | 133 | 213 | 540 |
----------------------------------------
Note C1. memcpy is slower than strcpy
for relatively short strings
Note C2. memmove is slower than memcpy
Raw Log : http://groups.google.com/groups?selm=bii0e2%249mi9c%241%40ID-79865.news.uni-berlin.de

Summary test results
MINGW 2.0.0-2
gcc/g++ version 3.2 (mingw special 20020817-1)
============================
----------------------------------------
| | | User time used for |
| N | Function | string size |
| | |-----------------------|
| | | 10 | 100 | 1000 |
|--------------------------------------|
| 1 | strcpy | 50 | 250 | 1982 |
| 2 | memcpy | 51 | 180 | 1150 |
| 3 | memmove | 50 | 180 | 1161 |
| 4 | memset | 110 | 160 | 607 |
----------------------------------------
Note M1. Actually memcpy is faster than strcpy
Note M2. memmove & memcpy have the same performance
Raw Log : http://groups.google.com/groups?selm=bii0ee%249mi9c%242%40ID-79865.news.uni-berlin.de

Summary test results
DJGPP 2.03
gcc/gpp version 3.2.1
============================
----------------------------------------
| | | User time used for |
| N | Function | string size |
| | |-----------------------|
| | | 10 | 100 | 1000 |
|--------------------------------------|
| 1 | strcpy | 63 | 567 | 5329 |
| 2 | memcpy | 173 | 329 | 1492 |
| 3 | memmove | 274 | 384 | 1098 |
| 4 | memset | 191 | 292 | 695 |
----------------------------------------
Note D1. For very short strings
memcpy is slower than strcpy,
for other strings
memcpy is faster than strcpy,
Note D2. strcpy seems to work too slow
(relative to CYGWIN, MINGW)
Raw Log : http://groups.google.com/groups?selm=bii0e2%249mi9c%241%40ID-79865.news.uni-berlin.de

================ Performance tests : END ==================

==============================================
Alex Vinokur
mailto:ale...@connect.to
http://mathforum.org/library/view/10978.html
==============================================

Jack Klein

unread,

Aug 27, 2003, 11:17:35 PM8/27/03

to

On Wed, 27 Aug 2003 18:09:00 +0300, "Alex Vinokur"
<ale...@bigfoot.com> wrote in comp.lang.c:

>
> Here are results of comparative performance tests carried out
> using the same compiler (gcc 3.2)
> in different environments (CYGWIN, MINGW, DJGPP)
> on Windows 2000 Professional.

Please don't post stuff like this in comp.lang.c, it's completely
off-topic here.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++ ftp://snurse-l.org/pub/acllc-c++/faq

pete

unread,

Aug 28, 2003, 8:07:58 AM8/28/03

to

Jack Klein wrote:
>
> On Wed, 27 Aug 2003 18:09:00 +0300, "Alex Vinokur"
> <ale...@bigfoot.com> wrote in comp.lang.c:
>
> >
> > Here are results of comparative performance tests carried out
> > using the same compiler (gcc 3.2)
> > in different environments (CYGWIN, MINGW, DJGPP)
> > on Windows 2000 Professional.
>
> Please don't post stuff like this in comp.lang.c, it's completely
> off-topic here.

I thought it was interesting and I don't think that there's
another newsgroup where the speed comparisons of
various standard C library functions on various platforms,
is on topic.
His post was language specific, C.

People are always asking "what's faster", on this newsgroup.
I don't see anything wrong with a practical demonstration that
"Precise answers to these and many similar questions depend
of course on the processor and compiler in use. If you simply
must know, you'll have to time test programs carefully."

http://www.eskimo.com/~scs/C-faq/q20.14.html

--
pete

Alan Balmer

unread,

Aug 28, 2003, 12:25:08 PM8/28/03

to

On Thu, 28 Aug 2003 12:07:58 GMT, pete <pfi...@mindspring.com> wrote:

>People are always asking "what's faster", on this newsgroup.

And people are always being told that such questions are both
off-topic and meaningless to standard C.

--
Al Balmer
Balmer Consulting
removebalmerc...@att.net

Mon

unread,

Aug 28, 2003, 12:42:29 PM8/28/03

to

This is about measurement of software, not software for measurement.....
pete <pfi...@mindspring.com> wrote in message
news:3F4DF0...@mindspring.com...

Joona I Palaste

unread,

Aug 28, 2003, 1:14:06 PM8/28/03

to

Mon <m...@indigo.net> scribbled the following
on comp.lang.c:

> This is about measurement of software, not software for measurement.....

Yes, and? Strictly speaking, software is off-topic on comp.lang.c.
That is, software whose source code (written in C) is not being
discussed.

--
/-- Joona Palaste (pal...@cc.helsinki.fi) ---------------------------\
| Kingpriest of "The Flying Lemon Tree" G++ FR FW+ M- #108 D+ ADA N+++|
| http://www.helsinki.fi/~palaste W++ B OP+ |
\----------------------------------------- Finland rules! ------------/
"I am not very happy acting pleased whenever prominent scientists overmagnify
intellectual enlightenment."
- Anon

Randy Howard

unread,

Aug 28, 2003, 2:20:57 PM8/28/03

to

In article <biihim$9opf4$1...@ID-79865.news.uni-berlin.de>,
ale...@bigfoot.com says...

> The following C-functions were tested :
> * strcpy
> * memcpy
> * memmove
> * memset

> # Resource Name : user time used (via rusage)

Note that "user time" may not be an appropriate way to measure such
things. "Wall clock time" (on an unloaded system of course) is a
far more appropriate method. Not to mention that rusage info is
not portable.

I also noticed that you posted this to c.l.c, yet the source in
the measurement package you referenced is C++. This effectively
renders the results of little or no value for c.l.c readers
interesting in Standard C library performance using the C language
on compilers being used as C compilers, not C++ compilers.

> | | | User time used for |
> | N | Function | string size |
> | | |-----------------------|
> | | | 10 | 100 | 1000 |

If this is meant to be useful, the values used should include
larger sizes. Modern systems move strings (some) and memory (often)
of much larger block lengths. Additionally, you don't state whether
or not these values are from a single run or multiple runs.

If from multiple runs, are they the value from the best pass, worst
pass, average of the passes, etc.? Also, if multiple runs, are the
source and destination addresses the same for each iteration? If so,
cache warming could be artificially skewing the results as well.

I didn't dig into the C++ code to see the details as they pertain to
the above.

--
Randy Howard _o
2reply remove FOOBAR \<,
______________________()/ ()______________________________________________
SCO Spam-magnet: postm...@sco.com

Mon

unread,

Aug 28, 2003, 6:24:30 PM8/28/03

to

Yes, and you are double-posting to comp.software.measurement...........
Joona I Palaste <pal...@cc.helsinki.fi> wrote in message
news:bild8u$pcm$1...@oravannahka.helsinki.fi...

Joona I Palaste

unread,

Aug 29, 2003, 1:34:30 AM8/29/03

to

Mon <m...@indigo.net> scribbled the following
on comp.lang.c:

> Yes, and you are double-posting to comp.software.measurement...........

Well then, you're crossposting to comp.lang.c and gnu.gcc.help. But
you're right, sorry about not noticing the crosspost.

--
/-- Joona Palaste (pal...@cc.helsinki.fi) ---------------------------\
| Kingpriest of "The Flying Lemon Tree" G++ FR FW+ M- #108 D+ ADA N+++|
| http://www.helsinki.fi/~palaste W++ B OP+ |
\----------------------------------------- Finland rules! ------------/

"Stronger, no. More seductive, cunning, crunchier the Dark Side is."
- Mika P. Nieminen

Alex Vinokur

unread,

Aug 30, 2003, 5:16:48 AM8/30/03

to

"Randy Howard" <randy....@FOOmegapathdslBAR.net> wrote in message news:MPG.19b80400f...@news.megapathdsl.net...

> In article <biihim$9opf4$1...@ID-79865.news.uni-berlin.de>,
> ale...@bigfoot.com says...
> > The following C-functions were tested :
> > * strcpy
> > * memcpy
> > * memmove
> > * memset
> > # Resource Name : user time used (via rusage)
>
> Note that "user time" may not be an appropriate way to measure such
> things. "Wall clock time" (on an unloaded system of course) is a
> far more appropriate method.

http://www.hyperdictionary.com/computing/wall+clock+time
http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?wall+clock+time

<QUOTE>
wall clock time
The elapsed time between when a process starts to run and when it is finished. This is usually longer than the processor time
consumed by the process because the CPU is doing other things besides running the process such as running other user and operating
system processes or waiting for disk or network I/O.
</QUOTE>

What should one do to measure "wall clock time"? Only to cause a system to be unloaded?
What is the difference between measuring "user time used" and "wall clock time"?
"User time used" can be measured using clock(), uclock(), getrusage(), etc. Is this correct as to "wall clock time"?

> Not to mention that rusage info is not portable.

It is right. However,.
1. clock() has low resolution (by tha way, CLOCKS_PER_SEC is _machine-dependent_ macro).
2. To measure performance everyone can use appropriate (system specific or not) get-time-functions.
Otherwise what do "system specific get-time-functions" exist for?
3. C/C++ Program Perfometer
(http://sourceforge.net/projects/cpp-perfometer/, http://alexvn.freeservers.com/s1/perfometer.html)
which was used to measure performance of str-functions and mem-functions
enables to get performance of C/C++ program and separated pieces of code for _any_ metrics
(for instance : clocks, uclocks, rusage-metrics, metrics defined by user etc.).

>
> I also noticed that you posted this to c.l.c, yet the source in
> the measurement package you referenced is C++. This effectively
> renders the results of little or no value for c.l.c readers
> interesting in Standard C library performance using the C language
> on compilers being used as C compilers, not C++ compilers.

Right.
C/C++ Program Perfometer is written in C++.
So, source unit related to measured C-functions has been compiled with C++ compiler.
I think what should be done is to compile that source unit with C compiler.

>
> > | | | User time used for |
> > | N | Function | string size |
> > | | |-----------------------|
> > | | | 10 | 100 | 1000 |
>
> If this is meant to be useful, the values used should include
> larger sizes. Modern systems move strings (some) and memory (often)
> of much larger block lengths.

#################################################################

str-functions vs. mem-functions (C-language)
============================================

C/C++ Performance Tests
=======================
Using C/C++ Program Perfometer
http://sourceforge.net/projects/cpp-perfometer
http://alexvn.freeservers.com/s1/perfometer.html

Environment
-----------
Windows 2000 Professional
* CYGWIN_NT-5.0 1.3.22(0.78/3/2)
* MINGW 2.0.0.-2
* DJGPP 2.03
Intel(R) Celeron(R) CPU 1.70 GHz
GNU g++/gpp 3.2
Compilation : No optimization

================ Performance tests : BEGIN ================

#==========================================================
# Comparison : str-functions vs. mem-functions (C)
#----------------------------------------------------------

# Resource Name : user time used (via rusage)

# Resource Cost Unit : milliseconds (unsigned long long)

# per 30000 calls (repetitions)

# Resource State Unit : timeval
#==========================================================

Summary test results
CYGWIN_NT-5.0 1.3.22(0.78/3/2)
gcc/g++ version 3.2 20020927 (prerelease)
============================

----------------------------------------------

| | |-----------------------------|
| | | 1000 | 10000 | 100000 |
|--------------------------------------------|
| 1 | strcpy | 43 | 467 | 21233 |
| 2 | memcpy | 40 | 380 | 18576 |
| 3 | memmove | 40 | 381 | 18590 |
| 4 | memset | 20 | 117 | 7945 |
----------------------------------------------

Summary test results
MINGW 2.0.0-2
gcc/g++ version 3.2 (mingw special 20020817-1)
============================

----------------------------------------------

| | |-----------------------------|
| | | 1000 | 10000 | 100000 |
|--------------------------------------------|
| 1 | strcpy | 50 | 477 | 15389 |
| 2 | memcpy | 33 | 247 | 16443 |
| 3 | memmove | 26 | 253 | 16454 |
| 4 | memset | 10 | 80 | 6726 |
----------------------------------------------

Summary test results
DJGPP 2.03
gcc/gpp version 3.2.1
============================

----------------------------------------------

| | |-----------------------------|
| | | 1000 | 10000 | 100000 |
|--------------------------------------------|
| 1 | strcpy | 164 | 1648 | 22746 |
| 2 | memcpy | small | 329 | 17105 |
| 3 | memmove | 54 | 274 | 20896 |
| 4 | memset | small | 109 | 8223 |
----------------------------------------------

================ Performance tests : END ==================

#################################################################

> Additionally, you don't state whether
> or not these values are from a single run or multiple runs.
>
> If from multiple runs, are they the value from the best pass, worst
> pass, average of the passes, etc.? Also, if multiple runs, are the
> source and destination addresses the same for each iteration? If so,
> cache warming could be artificially skewing the results as well.

[snip]

Very schematically measurements are carried out as following :

time_type measured_time[NO_OF_TESTS];
for (i = 0; i < NO_OF_TESTS; i++)
{
start_time = ...
for (k = 0; k < NO_OF_REPETITIONS; k++)
{
/* code to be measured */
}
end_time = ...
measured_time[i] = end_time - start_time;
}

sort (measured_time);

sum = 0;
for (i = THRESHOLD; i < (NO_OF_TESTS - THRESHOLD); i++)
{
sum += measured_time[i];
}

elapsed_time = sum/()NO_OF_TESTS - 2*THRESHOLD);

Randy Howard

unread,

Aug 30, 2003, 9:58:30 PM8/30/03

to

In article <bipq2a$bugo5$1...@ID-79865.news.uni-berlin.de>,
ale...@bigfoot.com says...

> > Note that "user time" may not be an appropriate way to measure such
> > things. "Wall clock time" (on an unloaded system of course) is a
> > far more appropriate method.
>

> What should one do to measure "wall clock time"? Only to cause a system to be unloaded?

My perception is probably a bit colored because most of the performance
measurement work I've been involved in for the last few years has had to
do with measuring throughput on a system, rather than system call timings
themselves. For throughput measurement, you typically want to know, for
example, that a given gigabit ethernet NIC can send/receive a certain
amount of data (usually measured in mbits/sec) and sustain that level
for some amount of time. It also might be the switch that you want to
measure crossbar bandwidth of, without performance measurement
capabilities in the switch itself. In this case, you'd nade to
accumulate the results across all ports connected to the switch.

Another example, measuring memory bandwidth, might look at the differences
in memcpy() performance on large block memory moves across a 400Mhz vs
533Mhz FSB implementation (Intel) or comparing that against the AMD
hypertransport implementation used for an Opteron system. In such
cases, rather than wanting to know the overhead of making 10,000 calls
to memcpy with 100 bytes of data, you're more likely to want to do
something like see how many megabytes of data you can actually move per
second. In the case of the Opteron, you might also find that a NUMA
kernel implementation (if available) might afford significant increases
in measured throughput.

So, "wall clock" time, I.e. how much data can the implementation handle in
a physical time period is the thing you're targeting.

In the case of an SMP system, you might use multiple threads to try and
get as much aggregate throughput across the memory bus as possible per
second. As such, measuring per process (or per thread) code execution
time is not the same as timing how fast data can be moved overall. I've
been working on a lot of this type of stuff lately, so it probably made me
look at it differently than you are.

> 1. clock() has low resolution (by tha way, CLOCKS_PER_SEC is _machine-dependent_ macro).

clock() has varying degrees of resolution depending on the system. System
dependent values are not the same as proprietary, or non-portable
interfaces btw.

Alternately, you can get the regular time, even if measured in whole
seconds if you are willing to run enough iterations such that the error in
time granularity is not a factor. A 30 minute test run where you
accumulate the the number of calls completed, then divide by seconds is
likely to be good enough. Any test that completes in a very short time
period is likely to be inaccurate anyway, or at least is unlikely to
reflect real-world results on a modern OS platform with a lot of other
background processes firing off at random intervals.

> 2. To measure performance everyone can use appropriate (system specific or not) get-time-functions.
> Otherwise what do "system specific get-time-functions" exist for?

Because not all platforms have a high-resolution timer available, making
it hard to standardize. If you only intend to run the test(s) on a
finite number of platforms, then you can use the best possible method
for each, as long as when you're done you are sure that you are making
apples to apples comparisons when presenting results from the different
platforms.