Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RegNotifyChangeKeyValue and ipc?

8 views
Skip to first unread message

L.Allan

unread,
Oct 23, 2008, 8:38:31 AM10/23/08
to
I'm planning to implement a "master/parent" process that creates several
"child/slave" processes (from 1 to 10+ children).

Each process displays a different but similar document document. When the
parent process does a search for a certain word(s) with certain criteria
(case-sensitive, exact match, range, etc), I want all the child processes to
do the same search on their document and display "match hits". When a child
process does a search, I don't want the parent and other children to do that
search.

My experience is that processes are easier to get to multi-thread with a
multi-core cpu that having an MDI-like app with multiple threads. (or the
somewhat newer Multiple-Top-Level-Document architecture).

I'm considering communicating from the parent to the children with
RegNotifyChangeKeyValue as the means of inter-process-communication (ipc).
I've tried a simple POC (proof of concept) and it seems to work well. Each
of the children are blocked "listening" for a registry key to change. The
contents of the registry key will have a string with the search criteria,
such as
1 0 0 0 1 12 65 "words to search"

I've been able to get this to work with communication using defined
PostMessages and also with Events. However, for the intended purpose, using
RegNotifyChangeKeyValue seems simpler. Also, it seems to scale well, and
doesn't have to be concerned about child processes "coming and going" if one
or more are closed by the end-user, or "fresh/new" children are created.

It seems kind of the equivalent of a BroadCast, but limited only to the
child processes that are "listening" to a specific sub-hive. My impression
is that a PostMessage(BROADCAST) is inappropriate for several reasons,
including that many/all processes have to "wake up" to see if the broadcast
message is for them.

I'd appreciate hearing information on the pro's and con's of this approach.
I haven't done high resolution timings to compare to the PostMessage
approach or Event approach. Each of the searches take from 15 to 50 ms per
cpu-core on a relatively modern computer, so I don't think the overhead is
too much of a factor (probably several microseconds).

Joseph M. Newcomer

unread,
Oct 23, 2008, 10:07:40 AM10/23/08
to
See below...

On Thu, 23 Oct 2008 06:38:31 -0600, "L.Allan" <lynn.d...@gmail.com> wrote:

>I'm planning to implement a "master/parent" process that creates several
>"child/slave" processes (from 1 to 10+ children).
>
>Each process displays a different but similar document document. When the
>parent process does a search for a certain word(s) with certain criteria
>(case-sensitive, exact match, range, etc), I want all the child processes to
>do the same search on their document and display "match hits". When a child
>process does a search, I don't want the parent and other children to do that
>search.
>
>My experience is that processes are easier to get to multi-thread with a
>multi-core cpu that having an MDI-like app with multiple threads. (or the
>somewhat newer Multiple-Top-Level-Document architecture).

****
Threading is threading, and the scheduler does not care which process owns the threads.
However, it is effectively slower if you have multiple processes because the process swap
cost is more expensive, whereas if the thread about to be scheduled is in the same process
as the previous thread there is no process-swap cost.
****


>
>I'm considering communicating from the parent to the children with
>RegNotifyChangeKeyValue as the means of inter-process-communication (ipc).
>I've tried a simple POC (proof of concept) and it seems to work well. Each
>of the children are blocked "listening" for a registry key to change. The
>contents of the registry key will have a string with the search criteria,
>such as
>1 0 0 0 1 12 65 "words to search"

****
Probably the worst possible way I could imagine to communicate; even worse than writing a
shared file. Don't even bother thinking any further about this as a way of communication.

It just doesn't make sense to block a process. Since you create the programs, you have
the handle to the process, you can trivially find the window from this, and you can
SendMessage(WM_COPYDATA) or use any other IPC mechanism that makes sense to handle this.
Consider a model that blocks waiting for a message (other than in the top-level message
loop) to be a Fundamentally Bad Design.
****


>
>I've been able to get this to work with communication using defined
>PostMessages and also with Events. However, for the intended purpose, using
>RegNotifyChangeKeyValue seems simpler.

****
I don't see how. It involves a wait which can block indefinitely, and it also has
problems such as dealing with the Registry. I wouldn't waste five minutes trying to
implement such a solution.
****


>Also, it seems to scale well, and
>doesn't have to be concerned about child processes "coming and going" if one
>or more are closed by the end-user, or "fresh/new" children are created.

****
The entire approach of using multiple processes seems to be the core of the problem; why
are you using multiple processes when a single process multithreaded would do the job much
more easily? The ability of the end user to close the processes seems a Really Bad Idea
anyway. Besides, as far as scaling, you rarely gain anything by scaling to more threads
than you have CPUs, and if you multithread within a single app, you have no problem
passing arbitrarily complex data structures between threads, because you are in the same
address space.

I'd create an I/O Completion Port, N search threads where N is the number of CPUs you
have, and queue up requests to the threads via the IOCP. This would take less effort than
trying to coordinate independent processes and use obscure and complex interprocess
communication such as waiting for a registry key to change. Let's see: using threads is
easier, runs faster, scales better, minimizes overheads, reduces complexity...what's not
to like about it?
****


>
>It seems kind of the equivalent of a BroadCast, but limited only to the
>child processes that are "listening" to a specific sub-hive. My impression
>is that a PostMessage(BROADCAST) is inappropriate for several reasons,
>including that many/all processes have to "wake up" to see if the broadcast
>message is for them.

****
Don't waste time on this solution. It does NOT "scale well", because if you create more
processes than you have CPUs, you actually REDUCE performance.
****


>
>I'd appreciate hearing information on the pro's and con's of this approach.
>I haven't done high resolution timings to compare to the PostMessage
>approach or Event approach. Each of the searches take from 15 to 50 ms per
>cpu-core on a relatively modern computer, so I don't think the overhead is
>too much of a factor (probably several microseconds).

****
PostMessage should be faster. Events should not be used at all. Process startup is
expensive. The communication mechanism is unnecessarily complex. I don't see anything
good about the proposed solution at all.
joe
****
>
>
Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

L.Allan

unread,
Oct 23, 2008, 11:16:47 AM10/23/08
to
Thanks for the reply and your direct thoughts.

My thinking would be that a worker or UI thread within the children/slave
processes would be listening for the RegNotifyChangeKey, and would notify
the main thread when received. That thread would be blocked, but not the
main thread. The main thread would continue to be responsive to the
end-user.

I'm certainly not all that experienced and by no means an expert on
multi-threading. My experience has been that it is difficult to get more
than a 60% speedup in a single process with multiple threads. I've seen that
non-scaling 60% speedup with a dual core, quad core, and an Intel SDV with
dual quad cores, and using IPC mechanisms such as PostMessage, SendMessage,
Events, critical sections, etc. That was part of why I was exploring the
multiple processes.

My test results with multi-threading were that if a single core could get
one thousand search loops of a 4MB file in a memory buffer done in 10
seconds, then two cores each doing search loops of two different 4MB
files/buffers would get about 1600 loops done in 10 seconds. I consider that
a so-so 60% speedup.

But a quad core doing search loops of four different 4MB files/buffers got a
total of only about 1700 search loops in 10 seconds, which was very
disappointing. I would expect a quad core to get a total of 3500 to 3900
search loops done (2% to 10% overhead). I'll acknowledge that someone who
knew what they were doing might get 300% to 400% speed up with a quad-core.

My speculation is that cache contention of relatively 4MB memory buffers
might be part of the explanation for the poor scaling with a single process
with four threads. However, when I used 4 separate processes to do the
search-loops of 4 different 4MB files in memory buffers, the quad-core got
over 3900 search loops done in 10 seconds ... very close to 400% scaling. I
tried smaller 1MB memory buffers with the single-process with 4 threads, but
that didn't help.

Note that for these timings were very repeatable, the system was basically
idle while the tests were running, and the processes were running at
ABOVE_AVERAGE_PRIORITY_CLASS.

During the ten seconds of the single-process with 4 threads, the TaskManager
showed a consistent 25% overall CPU usage. With 4 processes, the CPU meter
was "pegged".

I am aware, but ignorant about the IO completion port approach. I'll
investigate further.

"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:kk01g45o3fvj9drhq...@4ax.com...

Joseph M. Newcomer

unread,
Oct 23, 2008, 11:50:06 PM10/23/08
to
See below...

On Thu, 23 Oct 2008 09:16:47 -0600, "L.Allan" <lynn.d...@gmail.com> wrote:

>Thanks for the reply and your direct thoughts.
>
>My thinking would be that a worker or UI thread within the children/slave
>processes would be listening for the RegNotifyChangeKey, and would notify
>the main thread when received. That thread would be blocked, but not the
>main thread. The main thread would continue to be responsive to the
>end-user.

****
But this is a bizarre and baroque mechanism to use when there are simpler mechanisms
available, particularly mechanisms that scale well and have lower overhead, simpler code,
etc.
****


>
>I'm certainly not all that experienced and by no means an expert on
>multi-threading. My experience has been that it is difficult to get more
>than a 60% speedup in a single process with multiple threads. I've seen that
>non-scaling 60% speedup with a dual core, quad core, and an Intel SDV with
>dual quad cores, and using IPC mechanisms such as PostMessage, SendMessage,
>Events, critical sections, etc. That was part of why I was exploring the
>multiple processes.

****
You will get less than that with multiple processes. Remember, processes have NOTHING to
do with scheduling; the scheduler ONLY schedules threads, and the only role the process
serves is to tell the scheduler what memory map to load. When you load the memory map,
you lose things like the cached Translation Lookaside Buffer entries (TLBs), so not only
do you have additional overhead to load the map, it invalidates the TLB which gives
ANOTHER reduction in performance. If you have threads in the same process that are
competing directly, the process state will not be reloaded when the thread is scheduled.

Why would you think that adding overhead could make a multithreaded system run faster?
****


>
>My test results with multi-threading were that if a single core could get
>one thousand search loops of a 4MB file in a memory buffer done in 10
>seconds, then two cores each doing search loops of two different 4MB
>files/buffers would get about 1600 loops done in 10 seconds. I consider that
>a so-so 60% speedup.

****
Feels about right; your bottleneck is not the threading but the disk access to read the
files!
****


>
>But a quad core doing search loops of four different 4MB files/buffers got a
>total of only about 1700 search loops in 10 seconds, which was very
>disappointing. I would expect a quad core to get a total of 3500 to 3900
>search loops done (2% to 10% overhead). I'll acknowledge that someone who
>knew what they were doing might get 300% to 400% speed up with a quad-core.

****
But since the performance bottleneck is the disk, why do you think adding more cores is
going to make your disk run faster? Remember that your disk is at a MINIMUM of 5 orders
of magnitude slower than memory, and at the worst case can be 7 orders of magnitude
slower. The threading overheads are not your problem. So you are introducing a complex
solution (using processes) and a baroque ipc mechanism (waiting for registry updates) to
"improve" the efficiency of a task that is limited entirely by your disk access time; in
fact, it is distinctly a victim of the disk accesses. If disk access were instantaneous,
that would be a different issue, but with 5-7 orders of magnitude performance difference,
it is almost certainly the performance bottleneck.
****


>
>My speculation is that cache contention of relatively 4MB memory buffers
>might be part of the explanation for the poor scaling with a single process
>with four threads. However, when I used 4 separate processes to do the
>search-loops of 4 different 4MB files in memory buffers, the quad-core got
>over 3900 search loops done in 10 seconds ... very close to 400% scaling. I
>tried smaller 1MB memory buffers with the single-process with 4 threads, but
>that didn't help.

****
It isn't even in the noise of the computation. A half-revolution delay in getting the
disk data on a 7200rpm drive is 70us (assuming the head is in position), and the cycle
time of a 2.8GHz processor is 350ps, so the statistical average delay is 5 orders of
magnitude slower than the CPU. So threading is not your problem. And creating more
processes only creates problems, it doesn't solve them.
****


>
>Note that for these timings were very repeatable, the system was basically
>idle while the tests were running, and the processes were running at
>ABOVE_AVERAGE_PRIORITY_CLASS.

****
And the disk can only deliver data at a certain rate, no matter what the priority of the
threads might be. So the threads end up waiting the same amount of time for data to
arrive, even if they are running at priority 15.

I/O is *always* the bottleneck, which is why doing clever things to optimize large data
sets is rarely valuable. Instead of spending 99% of the time waiting for data, an
optimized might program might spend 98% of its time waiting.

And, for large datasets, you have not taken cache thrashing into account; cache
optimization can buy you an order of magnitude easily. Creating processes instead of
threads truly trashes performance because of the TLB invalidation, but even with threads,
you can assume that you can lose an order of magnitude performance to cache thrashing, and
creating multiple processes has ZERO improvement on that (and could make it worse!)
****


>
>During the ten seconds of the single-process with 4 threads, the TaskManager
>showed a consistent 25% overall CPU usage. With 4 processes, the CPU meter
>was "pegged".

****
Don't forget that if the CPU is busy handling I/O requests in the kernel, it is considered
"busy" from the viewpoint of the task manager, because it is not running the idle loop.
But it is not necessarily running your search code. So you could be bottlenecked just
getting the data in. *Always* suspect the data pipe with the lowest performance, which in
this case is the disk. Measure, measure, measure. You are committing the worst possible
sin in "optimizing" performance, which is to assume you know where the time is going and
consequently spend time solving a problem that may be entirely a figment of your
imagination.

Example: some years ago, we were having a performance problem with our compiler. Yet
nobody could find the "hotspot" using my performance tool. So I did what any good
engineer would do: I studied the actual problem. And as I was single-stepping through the
read loop, I noticed it did one kernel call per byte! I changed the code to read in a
thousand bytes at a time (literally: 200 words at 5 bytes each), and we got a *huge*
performance improvement.

Until you know where the time is going, creating complex solutions that solve an opinion
that (a) you know it is in the threading and (b) creating more processes makes the
scheduler more efficient is not good engineering.

"If you can't express it in numbers, it ain't science, it's opinion" (Robert A. Heinlein,
writing as his character "Lazarus Long")

Note also that the performance bottleneck could also be in the storage allocator if you
are hitting it hard.

I presume you are measuring an optimized release version and not a debug version (debug
versions can cost you at least an order of magnitude performance, and sometimes much more;
one NG poster was complaining that he could only add abut 5,000 elements/second to his
array, and that was true in the debug version; in the release version, with a GrowBy
parameter set to 500,000, I could add 12,000,000 elements/second, so the debug version
lost three orders of magnitude performance over the release version and a one-line twiddle
of the code)

You have no data. Why are you trying to optimize anything?
****


>
>I am aware, but ignorant about the IO completion port approach. I'll
>investigate further.

****
See my essay.

L.Allan

unread,
Oct 24, 2008, 3:27:42 PM10/24/08
to
The data is all in four sets of 4MB memory buffers (originally loaded from a
file ... one memory buffer per file). There is no file access going on
during the timings. File loading into the memory buffers is part of
initialization for the test.

A quad-core is getting about 60% speedup with one process+4 threads.

A quad-core with 4 processes is getting about 390% speedup.

"Joseph M. Newcomer" <newc...@flounder.com> wrote in message

news:f1g2g45furq271sa8...@4ax.com...

Joseph M. Newcomer

unread,
Oct 24, 2008, 10:34:48 PM10/24/08
to
See below...

On Fri, 24 Oct 2008 13:27:42 -0600, "L.Allan" <lynn.d...@gmail.com> wrote:

>The data is all in four sets of 4MB memory buffers (originally loaded from a
>file ... one memory buffer per file). There is no file access going on
>during the timings. File loading into the memory buffers is part of
>initialization for the test.
>
>A quad-core is getting about 60% speedup with one process+4 threads.
>
>A quad-core with 4 processes is getting about 390% speedup.

****
4MB is such a trivial amount of space that it should not impact overall behavior. You had
not said that the data is preloaded. When I get time, I'll see what my performance tools
tell me by running a similar example. It will probably take me a week to get around to
this.

0 new messages