I've tried RtlCopyMemory and it seems extremely slow. I'm periodically
copying 4096 ULONGs from an output buffer associated with a pended IRP
to an output buffer associated with a synchronous request but
performance is dismal. It seems as if each DWORD is being copied one by
one. The data-rate is efectively the same as if I picked up every DWORD
individually from the hardware device. Is the WdfMemoryCopy....
combination any better or is it just a wrapper around RtlCopyMemory.
If they are effectively the same, would it be better to have the service
owning the source buffer (an indefinitely pended IRP) to declare a
section object? Can I avoid copying altogether with this solution. Will
a section object work at all if the source buffer is pended (and
therefore locked in memory). i.e. will it be able to use (or does it
need) the system paging file as backing store.
Thanks in advance for any tips.
Charles
From usual memory to usual memory? or the on-device memory is involved?
--
Maxim S. Shatskih
Windows DDK MVP
ma...@storagecraft.com
http://www.storagecraft.com
d
--
This posting is provided "AS IS" with no warranties, and confers no rights.
"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:76qcsqF...@mid.individual.net...
Thanks in advance for any tips you might have.
Maybe.
>but I don't think so. It really seems as if the RtlCopyMemory
> is moving DWORD for DWORD instead of doing a block copy ( at least the
No, RtlCopyMemory is memcpy.
You need some serious profiler to find the cause of the slowdown.
Unless you are compiling your driver for Itanium 64 bits (with an old
DDK) or Alpha AXP 64 bits (never released), RtlCopyMemory() is an alias
for memcpy() which is inlined by the C compiler to code which operates
one ULONG_PTR at a time (a rep movs instruction and some alignment code
on i386 and x64).
In contrast, RtlMoveMemory() is an alias for memmove() which the
compiler does not inline, and which wastes a few cycles on a call
instruction and if() statements to test for memory range overlap.
But after that initial overhead, RtlMoveMemory() happens to use
a more intelligent (and larger) copying loop which may make it faster
than the inlined rep movs instruction produced by RtlCopyMemory().
Bizarre but true...
--
Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.netop.com * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
Information in this mail is hasty, not binding and may not be right.
Information in this posting may not be the official position of Netop
Solutions A/S, only the personal opinions of the author.
> In contrast, RtlMoveMemory() is an alias for memmove() which the
> compiler does not inline, and which wastes a few cycles on a call
> instruction and if() statements to test for memory range overlap.
> But after that initial overhead, RtlMoveMemory() happens to use
> a more intelligent (and larger) copying loop which may make it faster
> than the inlined rep movs instruction produced by RtlCopyMemory().
>
> Bizarre but true...
>
>
Just for info, the WDK docs compare these two instructions in exactly
the opposite way. Something to the effect of "RtlCopyMemory runs faster
but the two regions may not overlap"
Can you create a fake driver that performs all the tasks *but* the memcpy
and see what's the performance in that case?
Have a nice day
GV
"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:76r6ddF...@mid.individual.net...
It's only when I run the demo app that I get the bottle neck. The app is
very simple CodeGear stuff. Just picking up 4096 ULONGs every 20 ms or
so and writing the hex values to a listbox.
I've since added a second thread for the reads from circular buffer (the
RtlMemCopy initiator). The app runs a bit more smothely but I'm still
getting a system load of about 60% (pentium something-or-other with ICH7
chipset), which I find quite high.
Gianluca Varenni schrieb:
Uhm.... I think the culprit is the litbox. updating a listbox 50 times a
second with 4096 values seems quite bad to me...
I bet that if you keep your code that picks the ULONGs and disable the
update of the listbox, your application will run smoothly.
GV
Well this would be true if the compiler did not inline memcpy() or if
the compiler inlined both memcpy() and memmove(). In practice, the
compiler inlines memcpy() but not memmove() and the inline versions of
both functions are optimized for small copy sizes while the out-of-line
versions are optimized for medium copy sizes. A version optimized for
really large sizes (ouch) would include additional SSE/MMX instructions
to do prefetching and other CPU cache management and would be specific
to each CPU generation and brand.
#include <xmmintrin.h>
void moveto(__m128 * to, __m128 * from, int size)
{
for_as_long_as_it_takes
{
*to++ = *from++;
}
}
There's some additional logic to handle moves that don't align to 128
bits, and I also unroll the loop in an attempt to improve speed, and I
also have a try/except block to catch errors. This works both on 32-
bit and on 64-bit. One advantage is that you have full control over
the amount of unrolling, and if you feel adventurous you can go the
extra mile and try one of the more esoteric data movement techniques
that Intel suggested in their tech papers when the xmm instruction set
was first announced.
I checked the code that the compiler generated, and it looks pretty
decent. I don't know if this saves any significant amount of execution
time because my environment is so chip-bound that optimizing processor
performance doesn't improve the throughput one jota. And, of course,
depending on what you're trying to do, you may want to save/restore
some floating point state.
Alberto.
On May 12, 6:42 am, Jakob Bohm <j...@danware.dk> wrote:
> Charles Gardiner wrote:
> > Jakob Bohm schrieb:
>
> >> In contrast, RtlMoveMemory() is an alias for memmove() which the
> >> compiler does not inline, and which wastes a few cycles on a call
> >> instruction and if() statements to test for memory range overlap.
> >> But after that initial overhead, RtlMoveMemory() happens to use
> >> a more intelligent (and larger) copying loop which may make it faster
> >> than the inlined rep movs instruction produced by RtlCopyMemory().
>
> >> Bizarre but true...
>
> > Just for info, the WDK docs compare these two instructions in exactly
> > the opposite way. Something to the effect of "RtlCopyMemory runs faster
> > but the two regions may not overlap"
>
> Well this would be true if the compiler did not inline memcpy() or if
> the compiler inlined both memcpy() and memmove(). In practice, the
> compiler inlines memcpy() but not memmove() and the inline versions of
> both functions are optimized for small copy sizes while the out-of-line
> versions are optimized for medium copy sizes. A version optimized for
> really large sizes (ouch) would include additional SSE/MMX instructions
> to do prefetching and other CPU cache management and would be specific
> to each CPU generation and brand.
>
> --
> Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
> Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARKhttp://www.netop.com* tel:+45-45-90-25-25 * fax:+45-45-90-25-26
> Information in this mail is hasty, not binding and may not be right.
> Information in this posting may not be the official position of Netop
> Solutions A/S, only the personal opinions of the author.- Hide quoted text -
>
> - Show quoted text -
Actually, you should be very careful about touching the floating point
state in kernel mode, although the documentation for what you may and
may not do is (or used to be) sketchy.
Also, such optimized memcpy code needs to be protected by if
statements or function pointers to use completely different code for
different CPU versions and brands. The optimal strategy for a Core2 is
probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
a few. Which is why such code should really be provided by a specialist
vendor who can afford the time and hardware to design, tune and test on
every x86 and x64 CPU generation ever sold.
Ideally, that vendor would be the Microsoft team that writes the C
runtime library, but this would tend to be 1 or 2 CPU generations behind
the times due to new CPU designs being released more often than Windows
versions. It could also be the CPU makers themselves by exporting these
functions from the processor driver.
--
Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.netop.com * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
Yet here I'm moving 128 bits at a time, relying on a data type
provided to me by the DDK Compiler. I hope that the compiler handles
the difference between processor platforms, and that it generates
sensible code; yet I check every generated machine instruction. And
why did the compiler writers supply the facility if we're not supposed
to use it ?
When a processor is issued, Intel or Amd usually supply plenty of
technical notes, design sheets, and other hardware level
documentation. Those are valuable sources of enlightment. But beyond
that, more often than not I'd rather roll my own code. If nothing
else, it allows me to go above and beyond what an API can give me.
So far, knock on wood, that specific memory move code didn't cause any
grief.
Alberto.
On May 12, 11:05 am, Jakob Bohm <j...@danware.dk> wrote:
> Actually, you should be very careful about touching the floating point
> state in kernel mode, although the documentation for what you may and
> may not do is (or used to be) sketchy.
>
> Also, such optimized memcpy code needs to be protected by if
> statements or function pointers to use completely different code for
> different CPU versions and brands. The optimal strategy for a Core2 is
> probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
> a few. Which is why such code should really be provided by a specialist
> vendor who can afford the time and hardware to design, tune and test on
> every x86 and x64 CPU generation ever sold.
>
> Ideally, that vendor would be the Microsoft team that writes the C
> runtime library, but this would tend to be 1 or 2 CPU generations behind
> the times due to new CPU designs being released more often than Windows
> versions. It could also be the CPU makers themselves by exporting these
> functions from the processor driver.
>
> --
> Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
> Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARKhttp://www.netop.com* tel:+45-45-90-25-25 * fax:+45-45-90-25-26
> Information in this mail is hasty, not binding and may not be right.
> Information in this posting may not be the official position of Netop
Also please note that as far as I recall, OpenGL ICDs are treated
specially with respect to floating point because this need is so common
in them.
> Yet here I'm moving 128 bits at a time, relying on a data type
> provided to me by the DDK Compiler. I hope that the compiler handles
> the difference between processor platforms, and that it generates
> sensible code; yet I check every generated machine instruction. And
> why did the compiler writers supply the facility if we're not supposed
> to use it ?
>
The DDK Compiler is the very same as the user mode compiler used for
compiling the user mode part of that NT release. Not every facility
found in that compiler (it also supports C++ exceptions for instance) is
designed to run in kernel mode, and not every advanced operation or
intrinsic is backward compatible with every supported CPU. For
instance the SSE2 intrinsics are supposed to be used together with
conditional code that checks if there is SSE2 on the end users CPU
(and then uses another solution if there is not).
> When a processor is issued, Intel or Amd usually supply plenty of
> technical notes, design sheets, and other hardware level
> documentation. Those are valuable sources of enlightment. But beyond
> that, more often than not I'd rather roll my own code. If nothing
> else, it allows me to go above and beyond what an API can give me.
Me too, but it can get tedious to maintain a collection of 5 or 10
memcpy() implementations for different CPU microarchitectures.
While prototype implementations of such functions can usually be cribbed
from the pages of those technical notes, the code in those notes tends
to be conceptual and not tested to the level expected of a frequently
used CRT function.
>
> So far, knock on wood, that specific memory move code didn't cause any
> grief.
>
>
> Alberto.
>
>
>
> On May 12, 11:05 am, Jakob Bohm <j...@danware.dk> wrote:
>
>> Actually, you should be very careful about touching the floating point
>> state in kernel mode, although the documentation for what you may and
>> may not do is (or used to be) sketchy.
>>
>> Also, such optimized memcpy code needs to be protected by if
>> statements or function pointers to use completely different code for
>> different CPU versions and brands. The optimal strategy for a Core2 is
>> probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
>> a few. Which is why such code should really be provided by a specialist
>> vendor who can afford the time and hardware to design, tune and test on
>> every x86 and x64 CPU generation ever sold.
>>
>> Ideally, that vendor would be the Microsoft team that writes the C
>> runtime library, but this would tend to be 1 or 2 CPU generations behind
>> the times due to new CPU designs being released more often than Windows
>> versions. It could also be the CPU makers themselves by exporting these
>> functions from the processor driver.
>>
--
Jakob Bøhm, M.Sc.Eng. * j...@danware.dk * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.netop.com * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
I wish Microsoft did one of two things, either make sure that one can
fully build a driver with the standard MSVC compiler and tools, or
else give us a real kernel-side compiler, which generates kernel-safe
code and which provides a kernel-side C library. A driver building
project template would help a lot, and also an MSVC-integrated remote
driver loader utility and a Windbg that's fully integrated with MSVC.
I wish I could develop drivers with the same comfort I develop my C#
designer-based application code. Have a bunch of kernel side
components accessible to the Designer, point, point, click, click, and
presto! A driver. I'm definitely not a cmd box or xterm dude, and I
really dislike build.exe. Makes me go back to the seventies all over
again.
Alberto.
You can prove that by checking the version number. However, compilers
aren't much interested in luck. What, exactly, caused you trouble?
>I wish Microsoft did one of two things, either make sure that one can
>fully build a driver with the standard MSVC compiler and tools, or
>else give us a real kernel-side compiler, which generates kernel-safe
>code and which provides a kernel-side C library.
Why do you think they don't currently do that?
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.