WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Charles Gardiner

unread,

May 11, 2009, 6:28:07 AM5/11/09

to

Hi,
does anybody knows which is the more efficient method for copying memory
between two IRP buffers (METHOD_DIRECT), RtlCopyMemory or the
combination WdfMemoryCopyFromBuffer / WdfMemoryCopyToBuffer

I've tried RtlCopyMemory and it seems extremely slow. I'm periodically
copying 4096 ULONGs from an output buffer associated with a pended IRP
to an output buffer associated with a synchronous request but
performance is dismal. It seems as if each DWORD is being copied one by
one. The data-rate is efectively the same as if I picked up every DWORD
individually from the hardware device. Is the WdfMemoryCopy....
combination any better or is it just a wrapper around RtlCopyMemory.

If they are effectively the same, would it be better to have the service
owning the source buffer (an indefinitely pended IRP) to declare a
section object? Can I avoid copying altogether with this solution. Will
a section object work at all if the source buffer is pended (and
therefore locked in memory). i.e. will it be able to use (or does it
need) the system paging file as backing store.

Thanks in advance for any tips.

Charles

Maxim S. Shatskih

unread,

May 11, 2009, 9:39:24 AM5/11/09

to

> I've tried RtlCopyMemory and it seems extremely slow.

From usual memory to usual memory? or the on-device memory is involved?

--
Maxim S. Shatskih
Windows DDK MVP
ma...@storagecraft.com
http://www.storagecraft.com

Charles Gardiner

unread,

May 11, 2009, 12:48:15 PM5/11/09

to

Maxim S. Shatskih schrieb:

>> I've tried RtlCopyMemory and it seems extremely slow.
>
> From usual memory to usual memory? or the on-device memory is involved?
>

Yes, usual memory to usual memory. The device has already DMAed into the
pended output buffer. Another application (the synchronous non-pended
one) wants to get selected entries from the pended buffer.

Doron Holan [MSFT]

unread,

May 11, 2009, 1:19:39 PM5/11/09

to

WdfMemoryCopyFromBuffer is a wrapper around RtlCopyMemory with bounds
checks before the copy to make sure you do not trash memory

d

--

This posting is provided "AS IS" with no warranties, and confers no rights.

"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:76qcsqF...@mid.individual.net...

Charles Gardiner

unread,

May 11, 2009, 1:43:39 PM5/11/09

to

Doron Holan [MSFT] schrieb:

> WdfMemoryCopyFromBuffer is a wrapper around RtlCopyMemory with bounds
> checks before the copy to make sure you do not trash memory
>
> d
>

Doran,
thanks for the Info. Is there any faster way of doing a memory-to-memory
(user space) copy than using RtlCopyMemory? Currently, I have a demo
application (really just using it to demonstrate a driver/hardware
combination for customer) which moves 4096 ULONGS from a pended ouput
buffer to an output buffer associated with a synchronous IRP. I do this
every 50 ms but performance is dreadful. Admittedly, it might be the
timer in the application that is real problem by causing too many task
switches but I don't think so. It really seems as if the RtlCopyMemory
is moving DWORD for DWORD instead of doing a block copy ( at least the
only way can explain the latency and system load).

Thanks in advance for any tips you might have.

Maxim S. Shatskih

unread,

May 11, 2009, 2:00:27 PM5/11/09

to

> every 50 ms but performance is dreadful. Admittedly, it might be the
> timer in the application that is real problem by causing too many task
> switches

Maybe.

>but I don't think so. It really seems as if the RtlCopyMemory
> is moving DWORD for DWORD instead of doing a block copy ( at least the

No, RtlCopyMemory is memcpy.

You need some serious profiler to find the cause of the slowdown.

Jakob Bohm

unread,

May 11, 2009, 2:11:02 PM5/11/09

to

Unless you are compiling your driver for Itanium 64 bits (with an old
DDK) or Alpha AXP 64 bits (never released), RtlCopyMemory() is an alias
for memcpy() which is inlined by the C compiler to code which operates
one ULONG_PTR at a time (a rep movs instruction and some alignment code
on i386 and x64).

In contrast, RtlMoveMemory() is an alias for memmove() which the
compiler does not inline, and which wastes a few cycles on a call
instruction and if() statements to test for memory range overlap.
But after that initial overhead, RtlMoveMemory() happens to use
a more intelligent (and larger) copying loop which may make it faster
than the inlined rep movs instruction produced by RtlCopyMemory().

Bizarre but true...

--
Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.netop.com * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
Information in this mail is hasty, not binding and may not be right.
Information in this posting may not be the official position of Netop
Solutions A/S, only the personal opinions of the author.

Charles Gardiner

unread,

May 11, 2009, 2:44:39 PM5/11/09

to

Jakob Bohm schrieb:

> In contrast, RtlMoveMemory() is an alias for memmove() which the
> compiler does not inline, and which wastes a few cycles on a call
> instruction and if() statements to test for memory range overlap.
> But after that initial overhead, RtlMoveMemory() happens to use
> a more intelligent (and larger) copying loop which may make it faster
> than the inlined rep movs instruction produced by RtlCopyMemory().
>
> Bizarre but true...
>
>

Just for info, the WDK docs compare these two instructions in exactly
the opposite way. Something to the effect of "RtlCopyMemory runs faster
but the two regions may not overlap"

Gianluca Varenni

unread,

May 11, 2009, 4:01:55 PM5/11/09

to

Uhm, 4096 ULONGs every 50ms is 320kB/s, which is a really low amount of data
to be copied. I really dont think that your copy is the real bottleneck.

Can you create a fake driver that performs all the tasks *but* the memcpy
and see what's the performance in that case?

Have a nice day
GV

"Charles Gardiner" <inv...@invalid.invalid> wrote in message

news:76r6ddF...@mid.individual.net...

Charles Gardiner

unread,

May 11, 2009, 7:20:58 PM5/11/09

to

If I let the device and driver run freely, just writing to the circular
buffer at 20 MB/s I get a system load of only about 3-4% (driver keeping
scatter/gather FIFOs topped up)

It's only when I run the demo app that I get the bottle neck. The app is
very simple CodeGear stuff. Just picking up 4096 ULONGs every 20 ms or
so and writing the hex values to a listbox.

I've since added a second thread for the reads from circular buffer (the
RtlMemCopy initiator). The app runs a bit more smothely but I'm still
getting a system load of about 60% (pentium something-or-other with ICH7
chipset), which I find quite high.

Gianluca Varenni schrieb:

Gianluca Varenni

unread,

May 11, 2009, 7:36:05 PM5/11/09

to

"Charles Gardiner" <inv...@invalid.invalid> wrote in message

news:76rq6rF...@mid.individual.net...

> If I let the device and driver run freely, just writing to the circular
> buffer at 20 MB/s I get a system load of only about 3-4% (driver keeping
> scatter/gather FIFOs topped up)
>
> It's only when I run the demo app that I get the bottle neck. The app is
> very simple CodeGear stuff. Just picking up 4096 ULONGs every 20 ms or
> so and writing the hex values to a listbox.
>

Uhm.... I think the culprit is the litbox. updating a listbox 50 times a
second with 4096 values seems quite bad to me...

I bet that if you keep your code that picks the ULONGs and disable the
update of the listbox, your application will run smoothly.

GV

Jakob Bohm

unread,

May 12, 2009, 6:42:18 AM5/12/09

to

Well this would be true if the compiler did not inline memcpy() or if
the compiler inlined both memcpy() and memmove(). In practice, the
compiler inlines memcpy() but not memmove() and the inline versions of
both functions are optimized for small copy sizes while the out-of-line
versions are optimized for medium copy sizes. A version optimized for
really large sizes (ouch) would include additional SSE/MMX instructions
to do prefetching and other CPU cache management and would be specific
to each CPU generation and brand.

Alberto

unread,

May 12, 2009, 10:38:18 AM5/12/09

to

I use a loop with two __m128* pointer variables:

#include <xmmintrin.h>

void moveto(__m128 * to, __m128 * from, int size)
{
for_as_long_as_it_takes
{
*to++ = *from++;
}
}

There's some additional logic to handle moves that don't align to 128
bits, and I also unroll the loop in an attempt to improve speed, and I
also have a try/except block to catch errors. This works both on 32-
bit and on 64-bit. One advantage is that you have full control over
the amount of unrolling, and if you feel adventurous you can go the
extra mile and try one of the more esoteric data movement techniques
that Intel suggested in their tech papers when the xmm instruction set
was first announced.

I checked the code that the compiler generated, and it looks pretty
decent. I don't know if this saves any significant amount of execution
time because my environment is so chip-bound that optimizing processor
performance doesn't improve the throughput one jota. And, of course,
depending on what you're trying to do, you may want to save/restore
some floating point state.

Alberto.

On May 12, 6:42 am, Jakob Bohm <j...@danware.dk> wrote:
> Charles Gardiner wrote:
> > Jakob Bohm schrieb:
>
> >> In contrast, RtlMoveMemory() is an alias for memmove() which the
> >> compiler does not inline, and which wastes a few cycles on a call
> >> instruction and if() statements to test for memory range overlap.
> >> But after that initial overhead, RtlMoveMemory() happens to use
> >> a more intelligent (and larger) copying loop which may make it faster
> >> than the inlined rep movs instruction produced by RtlCopyMemory().
>
> >> Bizarre but true...
>
> > Just for info, the WDK docs compare these two instructions in exactly
> > the opposite way. Something to the effect of "RtlCopyMemory runs faster
> > but the two regions may not overlap"
>
> Well this would be true if the compiler did not inline memcpy() or if
> the compiler inlined both memcpy() and memmove(). In practice, the
> compiler inlines memcpy() but not memmove() and the inline versions of
> both functions are optimized for small copy sizes while the out-of-line
> versions are optimized for medium copy sizes. A version optimized for
> really large sizes (ouch) would include additional SSE/MMX instructions
> to do prefetching and other CPU cache management and would be specific
> to each CPU generation and brand.
>
> --
> Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33

> Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARKhttp://www.netop.com* tel:+45-45-90-25-25 * fax:+45-45-90-25-26

> Information in this mail is hasty, not binding and may not be right.
> Information in this posting may not be the official position of Netop

> Solutions A/S, only the personal opinions of the author.- Hide quoted text -
>
> - Show quoted text -

Jakob Bohm

unread,

May 12, 2009, 11:05:20 AM5/12/09

to

Alberto wrote:
> I use a loop with two __m128* pointer variables:
>
> #include <xmmintrin.h>
>
> void moveto(__m128 * to, __m128 * from, int size)
> {
> for_as_long_as_it_takes
> {
> *to++ = *from++;
> }
> }
>
> There's some additional logic to handle moves that don't align to 128
> bits, and I also unroll the loop in an attempt to improve speed, and I
> also have a try/except block to catch errors. This works both on 32-
> bit and on 64-bit. One advantage is that you have full control over
> the amount of unrolling, and if you feel adventurous you can go the
> extra mile and try one of the more esoteric data movement techniques
> that Intel suggested in their tech papers when the xmm instruction set
> was first announced.
>
> I checked the code that the compiler generated, and it looks pretty
> decent. I don't know if this saves any significant amount of execution
> time because my environment is so chip-bound that optimizing processor
> performance doesn't improve the throughput one jota. And, of course,
> depending on what you're trying to do, you may want to save/restore
> some floating point state.
>

Actually, you should be very careful about touching the floating point
state in kernel mode, although the documentation for what you may and
may not do is (or used to be) sketchy.

Also, such optimized memcpy code needs to be protected by if
statements or function pointers to use completely different code for
different CPU versions and brands. The optimal strategy for a Core2 is
probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
a few. Which is why such code should really be provided by a specialist
vendor who can afford the time and hardware to design, tune and test on
every x86 and x64 CPU generation ever sold.

Ideally, that vendor would be the Microsoft team that writes the C
runtime library, but this would tend to be 1 or 2 CPU generations behind
the times due to new CPU designs being released more often than Windows
versions. It could also be the CPU makers themselves by exporting these
functions from the processor driver.

--
Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK

http://www.netop.com * tel:+45-45-90-25-25 * fax:+45-45-90-25-26

Alberto

unread,

May 13, 2009, 10:57:36 AM5/13/09

to

With all due respect, I have written OpenGL ICDs for Windows, where
kernel-side floating point can be a necessity. I'm well aware of all
the ins and outs of kernel-side floating point. And as far as the gods
go, hey, I trust my hand at least as much as I trust theirs!

Yet here I'm moving 128 bits at a time, relying on a data type
provided to me by the DDK Compiler. I hope that the compiler handles
the difference between processor platforms, and that it generates
sensible code; yet I check every generated machine instruction. And
why did the compiler writers supply the facility if we're not supposed
to use it ?

When a processor is issued, Intel or Amd usually supply plenty of
technical notes, design sheets, and other hardware level
documentation. Those are valuable sources of enlightment. But beyond
that, more often than not I'd rather roll my own code. If nothing
else, it allows me to go above and beyond what an API can give me.

So far, knock on wood, that specific memory move code didn't cause any
grief.

Alberto.

On May 12, 11:05 am, Jakob Bohm <j...@danware.dk> wrote:

> Actually, you should be very careful about touching the floating point
> state in kernel mode, although the documentation for what you may and
> may not do is (or used to be) sketchy.
>
> Also, such optimized memcpy code needs to be protected by if
> statements or function pointers to use completely different code for
> different CPU versions and brands. The optimal strategy for a Core2 is
> probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
> a few. Which is why such code should really be provided by a specialist
> vendor who can afford the time and hardware to design, tune and test on
> every x86 and x64 CPU generation ever sold.
>
> Ideally, that vendor would be the Microsoft team that writes the C
> runtime library, but this would tend to be 1 or 2 CPU generations behind
> the times due to new CPU designs being released more often than Windows
> versions. It could also be the CPU makers themselves by exporting these
> functions from the processor driver.
>
> --
> Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33

> Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARKhttp://www.netop.com* tel:+45-45-90-25-25 * fax:+45-45-90-25-26

> Information in this mail is hasty, not binding and may not be right.
> Information in this posting may not be the official position of Netop

Jakob Bohm

unread,

May 13, 2009, 11:33:17 AM5/13/09

to

Alberto wrote:
> With all due respect, I have written OpenGL ICDs for Windows, where
> kernel-side floating point can be a necessity. I'm well aware of all
> the ins and outs of kernel-side floating point. And as far as the gods
> go, hey, I trust my hand at least as much as I trust theirs!
>

I assumed as much, given your elegant solution for making the C compiler
generate that loop, but I cannot assume that for the OP or any other
reader of this group.

Also please note that as far as I recall, OpenGL ICDs are treated
specially with respect to floating point because this need is so common
in them.

> Yet here I'm moving 128 bits at a time, relying on a data type
> provided to me by the DDK Compiler. I hope that the compiler handles
> the difference between processor platforms, and that it generates
> sensible code; yet I check every generated machine instruction. And
> why did the compiler writers supply the facility if we're not supposed
> to use it ?
>

The DDK Compiler is the very same as the user mode compiler used for
compiling the user mode part of that NT release. Not every facility
found in that compiler (it also supports C++ exceptions for instance) is
designed to run in kernel mode, and not every advanced operation or
intrinsic is backward compatible with every supported CPU. For
instance the SSE2 intrinsics are supposed to be used together with
conditional code that checks if there is SSE2 on the end users CPU
(and then uses another solution if there is not).

> When a processor is issued, Intel or Amd usually supply plenty of
> technical notes, design sheets, and other hardware level
> documentation. Those are valuable sources of enlightment. But beyond
> that, more often than not I'd rather roll my own code. If nothing
> else, it allows me to go above and beyond what an API can give me.

Me too, but it can get tedious to maintain a collection of 5 or 10
memcpy() implementations for different CPU microarchitectures.
While prototype implementations of such functions can usually be cribbed
from the pages of those technical notes, the code in those notes tends
to be conceptual and not tested to the level expected of a frequently
used CRT function.

>
> So far, knock on wood, that specific memory move code didn't cause any
> grief.
>
>
> Alberto.
>
>
>
> On May 12, 11:05 am, Jakob Bohm <j...@danware.dk> wrote:
>
>> Actually, you should be very careful about touching the floating point
>> state in kernel mode, although the documentation for what you may and
>> may not do is (or used to be) sketchy.
>>
>> Also, such optimized memcpy code needs to be protected by if
>> statements or function pointers to use completely different code for
>> different CPU versions and brands. The optimal strategy for a Core2 is
>> probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
>> a few. Which is why such code should really be provided by a specialist
>> vendor who can afford the time and hardware to design, tune and test on
>> every x86 and x64 CPU generation ever sold.
>>
>> Ideally, that vendor would be the Microsoft team that writes the C
>> runtime library, but this would tend to be 1 or 2 CPU generations behind
>> the times due to new CPU designs being released more often than Windows
>> versions. It could also be the CPU makers themselves by exporting these
>> functions from the processor driver.
>>

--
Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK

http://www.netop.com * tel:+45-45-90-25-25 * fax:+45-45-90-25-26

Alberto

unread,

May 14, 2009, 10:25:44 AM5/14/09

to

I have had bad luck trying to compile some user-side object-oriented
code with the DDK compiler. That makes me wonder whether it's the same
as the standard MSVC compiler.

I wish Microsoft did one of two things, either make sure that one can
fully build a driver with the standard MSVC compiler and tools, or
else give us a real kernel-side compiler, which generates kernel-safe
code and which provides a kernel-side C library. A driver building
project template would help a lot, and also an MSVC-integrated remote
driver loader utility and a Windbg that's fully integrated with MSVC.

I wish I could develop drivers with the same comfort I develop my C#
designer-based application code. Have a bunch of kernel side
components accessible to the Designer, point, point, click, click, and
presto! A driver. I'm definitely not a cmd box or xterm dude, and I
really dislike build.exe. Makes me go back to the seventies all over
again.

Alberto.

Tim Roberts

unread,

May 14, 2009, 11:26:02 PM5/14/09

to

Alberto <mor...@terarecon.com> wrote:
>
>I have had bad luck trying to compile some user-side object-oriented
>code with the DDK compiler. That makes me wonder whether it's the same
>as the standard MSVC compiler.

You can prove that by checking the version number. However, compilers
aren't much interested in luck. What, exactly, caused you trouble?

>I wish Microsoft did one of two things, either make sure that one can
>fully build a driver with the standard MSVC compiler and tools, or
>else give us a real kernel-side compiler, which generates kernel-safe
>code and which provides a kernel-side C library.

Why do you think they don't currently do that?
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.