Can extra processing threads help in this case?

Peter Olcott

unread,

Mar 21, 2010, 2:19:34 PM3/21/10

to

I have an application that uses enormous amounts of RAM in a
very memory bandwidth intensive way. I recently upgraded my
hardware to a machine with 600% faster RAM and 32-fold more
L3 cache. This L3 cache is also twice as fast as the prior
machines cache. When I benchmarked my application across the
two machines, I gained an 800% improvement in wall clock
time. The new machines CPU is only 11% faster than the prior
machine. Both processes were tested on a single CPU.

I am thinking that all of the above would tend to show that
my process is very memory bandwidth intensive, and thus
could not benefit from multiple threads on the same machine
because the bottleneck is memory bandwidth rather than CPU
cycles. Is this analysis correct?

Hector Santos

unread,

Mar 21, 2010, 3:41:39 PM3/21/10

to

Geez, and here I was hoping you would get your "second opinion" from a
more appropriate forum llke:

microsoft.public.win32.programmer.kernel

or one of the performance forums.

Peter Olcott wrote:

> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way.

How do you do this?

How much memory is the process loading?

Show code that shows how intensive this is. Is it blocking memory access?

I recently upgraded my
> hardware to a machine with 600% faster RAM and 32-fold more
> L3 cache. This L3 cache is also twice as fast as the prior
> machines cache.

What kind of CPU? Intel, AMD?

If Intel, what kind of INTEL chips are you using?

> When I benchmarked my application across the
> two machines, I gained an 800% improvement in wall clock

> time. The new machines CPU is only 11% faster than the prior

> machine. Both processes were tested on a single CPU.

Does this make sense to anyone? Two physical machines?

> I am thinking that all of the above would tend to show that
> my process is very memory bandwidth intensive, and thus
> could not benefit from multiple threads on the same machine
> because the bottleneck is memory bandwidth rather than CPU
> cycles. Is this analysis correct?

no.

But if you believe your application has reaches his optimal design
point and can not do any improved for machine performance, then you
probably wasted money on improving your machine which will provide you
no scalability benefits.

At best, it will allow you to do your email, web browser and
multi-task to other things while your application is chunking along at
100%.

--
HLS

Joseph M. Newcomer

unread,

Mar 21, 2010, 4:25:29 PM3/21/10

to

NOte in the i7 architecture the L3 cache is shared across all CPUs, so you are less likely
to be hit by raw memory bandwidth (which compared to a CPU is dead-slow), and the answer s
so whether multiple threads will work effectively can only be determined by measurement of
a multithreaded app.

Because your logic seems to indicate that raw memory speed is the limiting factor, and you
have not accounted for the effects of a shared L3 cache, any opnion you offer on what is
going to happen is meaningless. In fact, any opinion about performanance is by definition
meaningless; only actual measurements represent facts ("If you can't express it in
numbers, it ain't science, it's opinion" -- Robert A. Heinlein)

More below...

On Sun, 21 Mar 2010 13:19:34 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>I have an application that uses enormous amounts of RAM in a
>very memory bandwidth intensive way. I recently upgraded my
>hardware to a machine with 600% faster RAM and 32-fold more
>L3 cache. This L3 cache is also twice as fast as the prior
>machines cache. When I benchmarked my application across the
>two machines, I gained an 800% improvement in wall clock
>time. The new machines CPU is only 11% faster than the prior
>machine. Both processes were tested on a single CPU.

***
The question is whether you are measuring multiple threads in a single executable image
across multiple cores, or multiple executable images on a single core. Not sure how you
know that both processes were tested on a single CPU, since you don't mention how you
accomplished this (there are several techniques, but it is important to know which one you
used, since each has its own implications for predicting overall behavior of a system).
****

>
>I am thinking that all of the above would tend to show that
>my process is very memory bandwidth intensive, and thus
>could not benefit from multiple threads on the same machine
>because the bottleneck is memory bandwidth rather than CPU
>cycles. Is this analysis correct?

****
Nonsense! You have no idea what is going on here! The shared L3 cache could completely
wipe out the memory performance issue, reducing your problem to a cache-performance issue.
Since you have not conducted the experiment in multiple threading, you have no data to
indicate one way or the other what is going on, and it is the particular memory access
patterns of YOUR app that matter, and therefore, nobody can offer a meaningful estimate
based on your L1/L2/L3 cache accessses, whatever they may be.
joe
****
>
Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Joseph M. Newcomer

unread,

Mar 21, 2010, 4:38:58 PM3/21/10

to

More below...

On Sun, 21 Mar 2010 15:41:39 -0400, Hector Santos <sant...@nospam.gmail.com> wrote:

>Geez, and here I was hoping you would get your "second opinion" from a
>more appropriate forum llke:
>
> microsoft.public.win32.programmer.kernel
>
>or one of the performance forums.
>
>Peter Olcott wrote:
>
>> I have an application that uses enormous amounts of RAM in a
>> very memory bandwidth intensive way.
>
>
>How do you do this?
>
>How much memory is the process loading?
>
>Show code that shows how intensive this is. Is it blocking memory access?
>
> I recently upgraded my
>> hardware to a machine with 600% faster RAM and 32-fold more
>> L3 cache. This L3 cache is also twice as fast as the prior
>> machines cache.
>
>
>What kind of CPU? Intel, AMD?

***
Actually, he said it is an i7 architecture some hundreds of messages ago....
****

>
>If Intel, what kind of INTEL chips are you using?
>
>> When I benchmarked my application across the
>> two machines, I gained an 800% improvement in wall clock
>
> > time. The new machines CPU is only 11% faster than the prior

****
Based on what metric? Certainly, I hope you are not using clock speed, which is known to
be irrelevant to performance. Did you look at the size of the i-pipe microinstrtuction
cache on the two architectures? DId you look at the amount of concurrency in the
execution engine (CPUs since 1991 have NOT executed instructions sequentially, they just
maintain the illusion that they are). What about the new branch predictor in the i7
archietecture? CPU clock time is only comparable within a chipset family. It bears no
relationship to another chipset family, particularly an older model, since most of the
improvements come in the instruction and data pipelines, cache management (why do you
think there is now an L3 cache in the i7s?) and other microaspects of the architecture.
And if you used a "benchmark" program to ascertain this nominal 11% improvement, do you
know what instruction sequence was being executed when it made the measurement? Probably
not, but it turns out that's the level that matters. So how did you arrive at this
magical number 11%?

Note also that raw memory speed doesn't matter too much on real problems; cache management
is the killer of performance, and the wrong sequence of address accesses will thrash your
cache; and if you are modifying data it hurts even worse (a cache line has to be written
back before it can be reused). caching read-only pages works well, and if you mark your
data pages as "read only" after reading them in you can improve performance. But you are
quoting perormance numbers here without giving any explanation of why you think they
matter.
joe
****

>
>> machine. Both processes were tested on a single CPU.
>
>
>Does this make sense to anyone? Two physical machines?
>
>> I am thinking that all of the above would tend to show that
>> my process is very memory bandwidth intensive, and thus
>> could not benefit from multiple threads on the same machine
>> because the bottleneck is memory bandwidth rather than CPU
>> cycles. Is this analysis correct?

****
Precisely because the bottleneck appears to be memory performance, and precisely because
you have an L3 cache shared across all the chips, you are offering meanningless opinion
here. the ONLY way to figure out what is going to happen is to try real experiments! And
measure what they do. No amount of guesswork is going to tell you anything relevant, and
you are guessing when it is clear you have NO IDEA what the implications of the i7
technology are. They are NOT just "faster memory" or "11% faster CPU" (whatever THAT
means!). I downloaded the Intel docs and read them while I was working on my new
multithreading course, and the i7 is more than a clock speed and a memory speed.
joe
****

>
>no.
>
>But if you believe your application has reaches his optimal design
>point and can not do any improved for machine performance, then you
>probably wasted money on improving your machine which will provide you
>no scalability benefits.
>
>At best, it will allow you to do your email, web browser and
>multi-task to other things while your application is chunking along at
>100%.

Hector Santos

unread,

Mar 21, 2010, 8:07:01 PM3/21/10

to

Peter Olcott wrote:

As stated numerous times, your thinking is wrong. But I don't fault
you because you don't have the experience here, but you should not be
ignoring what EXPERTS are telling you - especially if you never
written multi-threaded applications.

Attached C/C++ simulation (testpeter2t.cpp) illustrates how your
single main thread process with a HUGE redundant memory access
requirement is not optimized for a multi-core/processor machine and
for any kind of scalability and performance efficiency.

Compile the attach application.

TestPeter2T.CPP will allow you to test:

Test #1 - a single main thread process
Test #2 - a multi-threads (2) process.

To run the single thread process, just run the EXE with no switches:

Here is TEST #1

V:\wc5beta> testpeter2t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
---------------------------------------
Time: 12297 | Elapsed: 0 | Len: 0
---------------------------------------
Total Client Time: 12297

The source code is set to allocate DWORD array with a total memory
block of 1.4 GB. I have a 2GB XP Dual Core Intel box. It should 50%
CPU.

Now this single process test provides the natural quantum scenario
with a processdata() function:

void ProcessData()
{
KIND num;
for(int r = 0; r < repeat; r++)
for (DWORD i=0; i < size; i++)
num = data[i];
}

By natural quantum, there is NO "man-made" interupts, sleeps or
yields. The OS will preempt this as naturally it can do it every quantum.

If you ran TWO single process installs like so:

start testpeter2T
start testpeter2T

On my machine it is seriously degraded BOTH process because the HUGE
virtual memory and paging requirements. The page faults were really
HIGH and it just never completed and I didn't wish to wait because it
was TOO obviously was not optimized for multiple instances. The
memory load requirements was too high here.

Now comes test #2 with threads, run the EXE with the /t switch and
this will start TWO threads and here are the results:

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 13500 | Elapsed: 0 | Len: 0
1 | Time: 13016 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 26516

BEHOLD!! Scalability using a SHARED MEMORY ACCESS threaded design.

I am going to recompile the code for 4 threads by changing:

#define NUM_THREADS 4 // # of threads

Lets try it:

V:\wc5beta>testpeter2t /t
- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 26078 | Elapsed: 0 | Len: 0
1 | Time: 25250 | Elapsed: 0 | Len: 0
2 | Time: 25250 | Elapsed: 0 | Len: 0
3 | Time: 24906 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 101484

So the summary so far:

1 thread - 12 ms
2 threads - 13 ms
4 threads - 25 ms

This is where you begin to look at various designs to improve things.
There are many ideas but it requires a look at your actual work load.
We didn't use a MEMORY MAP FILE and that MIGHT help. I should try
that, but lets try a 3 threads run:

#define NUM_THREADS 3 // # of threads

and recompile, run testpeter2t /t

- size : 357913941
- memory : 1431655764 (1398101K)
- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
* Wait For Thread Completion
* Done
---------------------------------------
0 | Time: 19453 | Elapsed: 0 | Len: 0
1 | Time: 13890 | Elapsed: 0 | Len: 0
2 | Time: 18688 | Elapsed: 0 | Len: 0
---------------------------------------
Total Time: 52031

How interesting!! To see how one thread got a near best case result.

You can actually normalize all this can probably come how with a
formula to guessimate what the performance with be with requests. But
this is where WORKER POOLS and IOCP come into play and if you are
using NUMA, the Windows NUMA API will help there too!

All in all peter, this proves how multithreads, using shared memory is
FAR superior then your misconceived idea that your application can not
be resigned for multi-core/processor machine.

I am willing to bet this simulator is for more stressful than your own
DFA/OCR application in its work load. ProcessData() here is don't NO
WORK at all but accessing memory. You will not be doing this, so the
ODDS are very high you will run much more efficiently than this simulator.

I want to hear you say "Oh My!" <g>

--
HLS

testpeter2t.cpp

Hector Santos

unread,

Mar 21, 2010, 9:30:22 PM3/21/10

to

Attached "version 2" of Testpeter2t.cpp, with command line help and
more options to play with different scenarios without recompiling.

testpeter2t /?

testpeter2t [options]

/t - start 2 threads to test
/t:n - start N threads to test
/s:# - # of DWORDs in array, default creates 1.4GB bytes
/r:# - repeat memory reader loop # times (10)

No switches will start a single main thread process test

Example: start 8 threads with ~390 MB array

Testpeter2t /t:8 /s:100000000

Example result on a DUAL CORE 2GB Windows XP

- size : 100000000
- memory : 400000000 (390625K)

- repeat : 10
* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3

- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7

* Resuming threads
- Resuming thread# 0 [000007DC] in 41 msecs.
- Resuming thread# 1 [000007F4] in 467 msecs.
- Resuming thread# 2 [000007D8] in 334 msecs.
- Resuming thread# 3 [000007D4] in 500 msecs.

- Resuming thread# 4 [000007D0] in 169 msecs.
- Resuming thread# 5 [000007CC] in 724 msecs.
- Resuming thread# 6 [000007C8] in 478 msecs.
- Resuming thread# 7 [000007C4] in 358 msecs.

* Wait For Thread Completion
* Done
---------------------------------------

--
HLS

testpeter2t.cpp

Peter Olcott

unread,

Mar 21, 2010, 10:06:20 PM3/21/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in
message news:vmvcq55tuhj1lunc6...@4ax.com...

> NOte in the i7 architecture the L3 cache is shared across
> all CPUs, so you are less likely
> to be hit by raw memory bandwidth (which compared to a CPU
> is dead-slow), and the answer s
> so whether multiple threads will work effectively can only
> be determined by measurement of
> a multithreaded app.
>
> Because your logic seems to indicate that raw memory speed
> is the limiting factor, and you
> have not accounted for the effects of a shared L3 cache,
> any opnion you offer on what is
> going to happen is meaningless. In fact, any opinion
> about performanance is by definition
> meaningless; only actual measurements represent facts ("If
> you can't express it in
> numbers, it ain't science, it's opinion" -- Robert A.
> Heinlein)

(1) Machine A performs process B in X minutes.
(2) Machine C performs process B in X/8 Minutes (800%
faster)
(3) The only difference between machine A and machine C is
that machine C has much faster access to RAM (by whatever
means).
(4) Therefore Process B is memory bandwidth bound.

Hector Santos

unread,

Mar 21, 2010, 11:43:51 PM3/21/10

to

Peter Olcott wrote:

>
> (1) Machine A performs process B in X minutes.
> (2) Machine C performs process B in X/8 Minutes (800%
> faster)
> (3) The only difference between machine A and machine C is
> that machine C has much faster access to RAM (by whatever
> means).
> (4) Therefore Process B is memory bandwidth bound.
>

Forget that. I just spent a few hours proving to your with posted
testpeter2t.cpp to illustrate how a multi-thread huge shared data
process is superior over running multiple process instances with
redundant huge data loading.

Have you tested it yourself?

--
HLS

Peter Olcott

unread,

Mar 21, 2010, 11:50:28 PM3/21/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message
news:esxrhHXy...@TK2MSFTNGP06.phx.gbl...

If you can provide a valid counter example to my reasoning
above, please do so, otherwise I will have to simply assume
that you are stubbornly wrong.

Hector Santos

unread,

Mar 22, 2010, 12:40:37 AM3/22/10

to

Peter Olcott wrote:

You're serious? You mean, you are going to ignore the the PROOF I
provided?

PS: The above is nonsense because you are comparing a SLOWER machine
with a FASTER machine. Your COMPARISION should be on the SAME machine.

Well, the fact you choose to ignore the technical PROOF means you are
FREAKING MORON!

--
HLS

Hector Santos

unread,

Mar 22, 2010, 2:54:03 AM3/22/10

to

Here is the result using a 1.5GB readonly memory mapped file. I
started with 1 single process thread, then switch to 2 threads, then
4, 6, 8, 10 and 12 threads. Notice how the processing time for the
earlier threads started high but decreased with the later thread. This
was the caching effect of the readonly memory file. Also note the
Global Memory Status *MEMORY LOAD* percentage. For my machine, it is
at 19% at steady state. But as expected it shuts up when dealing with
this large memory map file. I probably can fine tune the map views
better, but they are set as read only. Well, I'll leave OP to figure
out memory maps coding for his patented DFA meta file process.

V:\wc5beta>testpeter3t /s:3000000 /r:1
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0
---------------------------------------
Time: 2984 | Elapsed: 0
---------------------------------------
Total Client Time: 2984

V:\wc5beta>testpeter3t /s:3000000 /t:2 /r:1
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0

* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads

- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.

* Wait For Thread Completion

- Memory Load: 96%
* Done
---------------------------------------
0 | Time: 5407 | Elapsed: 0
1 | Time: 4938 | Elapsed: 0
---------------------------------------
Total Time: 10345

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:4
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0

* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
* Resuming threads

- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.

* Wait For Thread Completion

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:6
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0

* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5

* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.

* Wait For Thread Completion

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:8
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 16

* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7
* Resuming threads

- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
- Resuming thread# 6 in 478 msecs.
- Resuming thread# 7 in 358 msecs.

* Wait For Thread Completion

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:10
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 0

* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7

- Creating thread 8
- Creating thread 9
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
- Resuming thread# 6 in 478 msecs.
- Resuming thread# 7 in 358 msecs.
- Resuming thread# 8 in 962 msecs.
- Resuming thread# 9 in 464 msecs.

* Wait For Thread Completion

V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:12
- size : 3000000
- memory : 1536000000 (1500000K)
- repeat : 1
- Memory Load : 25%
- Allocating Data .... 16

* Starting threads
- Creating thread 0
- Creating thread 1
- Creating thread 2
- Creating thread 3
- Creating thread 4
- Creating thread 5
- Creating thread 6
- Creating thread 7

- Creating thread 8
- Creating thread 9
- Creating thread 10
- Creating thread 11
* Resuming threads
- Resuming thread# 0 in 41 msecs.
- Resuming thread# 1 in 467 msecs.
- Resuming thread# 2 in 334 msecs.
- Resuming thread# 3 in 500 msecs.
- Resuming thread# 4 in 169 msecs.
- Resuming thread# 5 in 724 msecs.
- Resuming thread# 6 in 478 msecs.
- Resuming thread# 7 in 358 msecs.
- Resuming thread# 8 in 962 msecs.
- Resuming thread# 9 in 464 msecs.
- Resuming thread# 10 in 705 msecs.
- Resuming thread# 11 in 145 msecs.

* Wait For Thread Completion

--
HLS

Woody

unread,

Mar 22, 2010, 3:49:53 AM3/22/10

to

On Mar 21, 11:19 am, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
> I have an application that uses enormous amounts of RAM in a
> very memory bandwidth intensive way.

Unfortunately, you cannot think of a single "memory bandwidth" as
being the limiting factor. The cache behavior is the biggest
determinant of speed, because reading/writing is much faster to cache
than to main memory (that's why cache is there). To truly optimize an
application, and to answer your question about more threads, you must
consider the low-level details of memory usage, such as, what is the
size of the cache lines? How is memory interleaved? What is the size
of the translation look-aside buffer? How is cache shared among the
cores? Is there one memory controller per processor (if you have
multiple processors), or per core?

There are tools such as AMD CodeAnalyst (free) or Intel VTune ($$$)
that measure these things. Once you know where the bottlenecks really
are, you can go to work rearranging your code to keep all the
computer's resources busy. You will need to run all the tests on your
actual app, or something close to it, for meaningful results.

BTW, the same details that determine memory speed also make the
comparison of CPU speed meaningless.

Peter Olcott

unread,

Mar 22, 2010, 7:17:58 AM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:uLe1PnXy...@TK2MSFTNGP05.phx.gbl...

Actually while I was sleeping and dreaming last night I
figured out that your "proof" could be relevant. It would
depend upon how well your code emulated my actual memory
access patterns.

In any case I also figured out that emulating my actual
memory access patterns should not be that difficult, and
then I could easily validate your ideas against mine. I did
this while asleep and dreaming. This is the very first time
in my life that I actually "dreamed something up."

Peter Olcott

unread,

Mar 22, 2010, 7:28:36 AM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:%23IrS0xY...@TK2MSFTNGP04.phx.gbl...

OK and where is the summary conclusion?
Also by using a memory mapped file your process would have
entirely different behavior than mine.

I known that it is possible that you could have been right
all along about this, and I could be wrong. I know this
because of a term that I coined. [Ignorance Squared].

[Ignorance Squared] is the process by which a lack of
understanding is perceived by the one whom lacks this
understanding as disagreement. Whereas the one whom has
understanding knows that the ignorant person is lacking
understanding the ignorant person lacks this insight, and is
thus ignorant even of their own ignorance, hence the term
[Ignorance Squared] .

Now that I have a way to empirically validate your theories
against mine (that I dreamed up last night while sleeping) I
will do this.

Peter Olcott

unread,

Mar 22, 2010, 7:33:13 AM3/22/10

to

It is very hard to reply to messages with quoting turned
off, please turn quoting on. Also please tell me how quoting
gets turned off.

When a process requires continual essentially random access
to data that is very much larger than the largest cache,
then I think that memory bandwidth could be a limiting
factor to performance.

"Woody" <ols...@sbcglobal.net> wrote in message
news:7ff25c57-b2a7-4b31...@d37g2000yqn.googlegroups.com...

Peter Olcott

unread,

Mar 22, 2010, 8:01:56 AM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:uLe1PnXy...@TK2MSFTNGP05.phx.gbl...

A group with a more specialized focus is coming to the same
conclusions that I have derived.

Joseph M. Newcomer

unread,

Mar 22, 2010, 10:31:05 AM3/22/10

to

See below...

***
Fred can dig a ditch 10 feet long in 1 hour. Charlie can dig a ditch 10 feet long in 20
minutes. Therefore, Charlie is faster than Fred by a factor of 3.

How long does it take Fred and Charlie working together to dig a ditch 10 feet long?
(Hint: any mathematical answer you come up with is wrong, because Fred and Charlie (a)
hate each other, and so Charlie tosses his dirt into the place Fred has to dig or (b) are
good buddies and stop for a beer halfway through the digging or (c) Chalie tells Fred he
can do it faster by himself, and Fred just sits there while Charlie does all the work and
finishes in 20 minutes, after which they go out for a beer. Fred buys.

You have made an obvious failure here in thinking that if one thread takes 1/k the time
and the only difference is memory bandwidth, that two threads are necessarily LINEAR. Duh!
IT IS NOT THE SAME WHEN CACHES ARE INVOLVED! YOU HAVE NO DATA! You are jumping to an
unwarranted conclusion based on what I can at best tell is a coincidence. And even if it
was true, caches give nonlinear effects, so you are not even making sense when you make
these assertions! You have proven a case for value N, but you have immediately assumed
that if you prove the case for N, you have proven it for case N+1, which is NOT how
inductive proofs work! So you were so hung up on geometric proofs, can you explain how,
when doing an inductive proof, that proving the case for 1 element tells you what the
result is for N+1 for arbitrary value N? Hell, it doesn't even tell you the results for
N=1, but you have immediately assumed that it is a valid proof for all values of N!

YOU HAVE NO DATA! You are making a flawed assumption of linearity that has no basis!
Going to your fixation on proof, in a nonlinear system without a closed-form analytic
solution, demonstrate to me that your only possible solution is based on a linear
assumption. You are ignoring all forms of reality here. You are asseting without basis
that the system is linear (it is known that systems with caches are nonlinear in memory
performance). So you are contradicting known reality without any evidence to support your
"axiom". It ain't an axiom, it's a wild-assed-guess.

Until you can demonstrate with actual measured performance that your system is COMPLETELY
linear behavior in an L3 cache system, there is no reason to listen to any of this
nonsense you keep esposusing as if it were "fact". You have ONE fact, and that is not
enough to raise your hypothesis to the level of "axiom".

All you have proven is that a single thread is limited by memory bandwidth. You have no
reason to infer that two threads will not BOTH run faster because of the L3 cache effects.
And you have ignored L1/L2 cache effects. You have a trivial example from which NOTHING
can be inferred about multithreaded performance. You have consistently confused
multiprocess programming with multithreading and arrived at erroneous conclusions based on
flawed experiments.

Note also if you use a memory-mapped file and two processes share the same mapping object
there is only one copy of the data in memory! THis has not previously come up in
discussions, but could be critical to your performance of multiple processes.
joe
****

Peter Olcott

unread,

Mar 22, 2010, 11:02:33 AM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:ioueq5hdsf5ut5pha...@4ax.com...

(1) People in a more specialized group are coming to the

same conclusions that I have derived.

(2) When a process requires essentially random (mostly
unpredictable) access to far more memory than can possibly
fit into the largest cache, then actual memory access time
becomes a much more significant factor in determining actual
response time.

Hector Santos

unread,

Mar 22, 2010, 11:14:27 AM3/22/10

to

Peter Olcott wrote:

> A group with a more specialized focus is coming to the same
> conclusions that I have derived.

Oh Peter, you're fibbing! The simulator I provided is a classic
example of an expert on the subject in action. If you wanted to learn
anything here, you should study it.

The process handler emulates your MEMORY ACCESS claims to the fullest
extent with minimum OP CODES of any other work. Any engineer (and by
the way, I am trained Chemical Engineer) with process control and
simulation experience can easily see the work I showed as proof in
invalidating your understanding and shows how multi-threads with
shared memory is superior to your single main thread process idea.

If you can see that in the code, then quite honestly, you don't know
how to program or understand the concept of programming.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 11:16:58 AM3/22/10

to

Joseph M. Newcomer wrote:

> Note also if you use a memory-mapped file and two processes share the same mapping object
> there is only one copy of the data in memory! THis has not previously come up in
> discussions, but could be critical to your performance of multiple processes.
> joe

He has been told that MMF can help him.

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 11:28:24 AM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:%23A9xbJd...@TK2MSFTNGP02.phx.gbl...

I am telling you the truth, I am almost compulsive about
telling the truth.
When the conclusions are final I will post a link here.

Hector Santos

unread,

Mar 22, 2010, 11:26:37 AM3/22/10

to

Peter Olcott wrote:

> (1) People in a more specialized group are coming to the
> same conclusions that I have derived.

You're lying. Stop the LYING! The code I posted proves it and if your
group is real, show us.

> (2) When a process requires essentially random (mostly
> unpredictable) access to far more memory than can possibly
> fit into the largest cache, then actual memory access time
> becomes a much more significant factor in determining actual
> response time.

But you are thinking an ACCESSOR has exclusive access and is running
uninterrupted 100% which is not the case here. In other words, while
one thread is sleep, the other thread has full access to local or
remote memory. The factor of "speed" can be factored out since it is
constant across the board.

The code I posted PROVES it!

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 11:31:17 AM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:%23Q4$1KdyK...@TK2MSFTNGP02.phx.gbl...

Since my process (currently) requires unpredictable access
to far more memory than can fit into the largest cache, I
see no possible way that adding 1000-fold slower disk access
could possibly speed things up. This seems absurd to me.

Hector Santos

unread,

Mar 22, 2010, 12:06:34 PM3/22/10

to

Peter Olcott wrote:

> Since my process (currently) requires unpredictable access
> to far more memory than can fit into the largest cache, I
> see no possible way that adding 1000-fold slower disk access
> could possibly speed things up. This seems absurd to me.

And I would agree it would be seem to be absurd to inexperience people.

But you need to TRUST the power of your multi-processor computer
because YOU are most definitely under utilizing it by a long shot.

The code I posted is the proof!

Your issue is akin to having a pickup truck, overloading the back,
piling things on each other, overweight beyond the recommended safety
levels per specifications of the car manufacturer (and city/state
ordinances), and now your driving, speed, vision of your truck are all
altered. Your truck won't go as fast now and if even if you could,
things can fall, people can die, crashes can happen.

You have two choices:

- You can stop and unload stuff and come back and pick it up on
2nd strip, your total travel time doubled.

- you can get a 2nd pick up truck, split the load and get
on a four lanes highway and drive side by side, sometimes
one creeps ahead, and the other moves ahead, and both reach
the destination at the near same expected time.

Same thing!

You are overloading your machine to the point it working very very
hard to satisfy your single thread process needs. You may "believe"
it is working at optimal speeds because it has uninterrupted exclusive
access but it is not reality. You are under utilizing the power of
your machine.

Whether you realize it or not, the overloaded pickup truck is smart
and is stopping you every X milliseconds checking if you have a 2nd
pickup truck to offload some work and do some moving for you!!

You need to change your thinking.

However, at this point, I don't think you have any coding skills,
because if you did, you would be EAGERLY JUMPING at the code I
provided to see for yourself.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 12:20:22 PM3/22/10

to

Peter Olcott wrote:

>> If you can see that in the code, then quite honestly, you
>> don't know how to program or understand the concept of
>> programming.

> I am telling you the truth, I am almost compulsive about
> telling the truth. When the conclusions are final I will post a link here.

What GROUP is this? No one will trust your SUMMARY unless you cite
the group. Until you do so, you're lying and making things up.

I repeat: If you can't see the code I posted proves your thinking is
incorrect, you don't know what you are talking about and its becoming
obvious now you don't have any kind of programming or even engineering
skills.

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 1:15:27 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:uGVpjmdy...@TK2MSFTNGP06.phx.gbl...

> Peter Olcott wrote:
>
>> Since my process (currently) requires unpredictable
>> access to far more memory than can fit into the largest
>> cache, I see no possible way that adding 1000-fold slower
>> disk access could possibly speed things up. This seems
>> absurd to me.
>
>
> And I would agree it would be seem to be absurd to
> inexperience people.
>
> But you need to TRUST the power of your multi-processor
> computer because YOU are most definitely under utilizing
> it by a long shot.
>
> The code I posted is the proof!

If it requires essentially nothing besides random access to
entirely different places of 100 MB of memory, thenn (then
and only then) would it be reasonably representative of my
process. Nearly all the my process does is look up in memory
the next place to look up in memory.

Peter Olcott

unread,

Mar 22, 2010, 1:21:52 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:%23o9WRud...@TK2MSFTNGP06.phx.gbl...

I did not examine the code because I did not want to spend
time looking at something that is not representative of my
process. Looks at the criteria on my other post, and if you
agree that it meets this criteria, then I will look at your
code.

You keep bringing up memory mapped files. Although this may
very well be a very good way to use disk as RAM, or to load
RAM from disk, I do not see any possible reasoning that
could every possibly show that a hybrid combination of disk
and RAM could ever exceed the speed of pure RAM alone.

If you can then please show me the reasoning that supports
this. Reasoning is the ONLY source of truth that I trust,
all other sources of truth are subject to errors. Reasoning
is also subject to errors, but, these errors can be readily
discerned as breaking one or more of the rules of correct
reasoning.

Hector Santos

unread,

Mar 22, 2010, 1:27:38 PM3/22/10

to

On Mar 22, 11:02 am, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> (2) When a process requires essentially random (mostly
> unpredictable) access to far more memory than can possibly
> fit into the largest cache, then actual memory access time
> becomes a much more significant factor in determining actual
> response time.

As a follow up, in the simulator ProcessData() function:

void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
Sleep(1);
for (DWORD i=0; i < size; i++) {
//num = data[i]; // array
num = fmdata[i]; // file mapping array view
}
}
}

This is a serialize access to the data. Its not random. When you have
multi-threads, you approach a empirical boundary condition where
multiple accessors are requesting the same memory. So in one hand,
the peter viewpoint, you have contention issue hence slow downs. On
the other hand, the you have a CACHING effect, where the reading done
by one thread benefits all others.

Now, we can alter this ProcessData() by adding a random access logic:

void ProcessData()
{
KIND num;
for(DWORD r = 0; r < nRepeat; r++) {
Sleep(1);
for (DWORD i=0; i < size; i++) {
DWORD j = (rand() % size);
//num = data[j]; // array
num = fmdata[j]; // file mapping array view
}
}
}

One would suspect higher pressures to move virtual memory into the
process working set in random fashion. But in reality, that
randomness may not be as over pressuring as you expect.

Lets test this randomness.

First a test with serialized access with two thread using a 1.5GB file
map.

V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2

- size : 3000000
- memory : 1536000000 (1500000K)

- repeat : 2
- Memory Load : 22%

- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads

- Resuming thread# 0 in 743 msecs.
- Resuming thread# 1 in 868 msecs.

* Wait For Thread Completion

- Memory Load: 95%
* Done
---------------------------------------
0 | Time: 5734 | Elapsed: 0
1 | Time: 4906 | Elapsed: 0
---------------------------------------
Total Time: 10640

Notice the MEMORY LOAD climbed to 95%, thats because the entire
spectrum of the data was read in.

Now lets try unpredictable random access. I added a /j switch to
enable the random indexing.

V:\wc5beta>testpeter3t /r:2 /s:3000000 /t:2 /j

- size : 3000000
- memory : 1536000000 (1500000K)

- repeat : 2
- Memory Load : 22%

- Allocating Data .... 0
* Starting threads
- Creating thread 0
- Creating thread 1
* Resuming threads

- Resuming thread# 0 in 116 msecs.
- Resuming thread# 1 in 522 msecs.

* Wait For Thread Completion

- Memory Load: 23%
* Done
---------------------------------------
0 | Time: 4250 | Elapsed: 0
1 | Time: 4078 | Elapsed: 0
---------------------------------------
Total Time: 8328

BEHOLD, it is even faster because of the randomness. The memory
loading didn't climb because it didn't need to virtually load the
entire 1.5GB into the process working set.

So once again, your engineering (and lack thereof) philosophy is
completely off base. You are under utilizing the power of your
machine.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 1:38:39 PM3/22/10

to

Peter Olcott wrote:

>> The code I posted is the proof!
>
> If it requires essentially nothing besides random access to
> entirely different places of 100 MB of memory, thenn (then
> and only then) would it be reasonably representative of my
> process. Nearly all the my process does is look up in memory
> the next place to look up in memory.

Good point, and moments ago I proved how random access is even BETTER!
I posted the work via google groups and I have not see it yet here
in microsoft's server. Lets wait 10 minutes or so before i post it
again here... FT! Here it is:

completely off base. You are under utilizing the power of your
machine.

I take pride in my extremely rich real world modeling experience, a
good bit of it in simulator advanced systems like a Nuclear Power
Plant, Wind Turbines, Steam Generators, Process Control Systems, Fault
Tolerance, Failure Analysis, Prediction Systems, AI and Expert
Systems, etc, etc, etc. I am very very very GOOD at this and I carry
that knowledge into my design and QA engineering of my product lines.

Your application is PRIMITIVE when it comes to the real world
applications of there. Case in point, just for our FTP server alone,
we have customers that are uploading HUGE files, including gigabytes.
That has to be processed by many parts of the entire framework in a
scaled manner, even if its just 1 request that happen to come in.

--
HLS

Mikel

unread,

Mar 22, 2010, 1:44:18 PM3/22/10

to

On 22 mar, 18:21, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
> "Hector Santos" <sant9...@nospam.gmail.com> wrote in message

> reasoning.- Ocultar texto de la cita -
>
> - Mostrar texto de la cita -

Hi,
First of all, sorry for jumping in. I'm no expert in this area, so I
can't help much in it.
However, I would like to make a couple of comments.
First of all, I think that if you ask a question in a forum, newsgroup
or whatever, you should be ready to accept the answer, even if you
don't like it or is against what you thought. If you are not going to
accept or at least take into account the opinion of people who know
probably better than you, don't ask. If you *know* that you solution
is the best, don't ask if it is, but if you ask, consider the option
that what other people say is true (or truer than what you think).
Second, regarding "Reasoning is the ONLY source of truth that I
trust", that's what ancient greeks did, and they reached truths like
"heavier objects fall faster than lighter ones". The scientific method
has proved as a better way to reach the truth in science. Set an
hypothesis, test it and reach a conclusion. If 1 example breaks the
hypothesis, it was wrong. stop. Change it. But you have to test it.
Besides, trying to even imagine what goes on in a multicore system,
with its caches and all, the OS, etc. and trying to control it, seems
like a task for someone who probably should not be asking the
questions, but answering them.
Just an opinion...

Hector Santos

unread,

Mar 22, 2010, 1:56:26 PM3/22/10

to

Hector Santos wrote:

> Peter Olcott wrote:

> One would suspect higher pressures to move virtual memory into the
> process working set in random fashion. But in reality, that randomness
> may not be as over pressuring as you expect.
>

> ...
>

> BEHOLD, it is even faster because of the randomness. The memory
> loading didn't climb because it didn't need to virtually load the entire
> 1.5GB into the process working set.

Peter, when you talk about unpredictable scenario, in the simulation
world, there are techniques to you to give you low and high boundary
conditions and also predict the most likely scenario.

Take the time and read concepts such as Pareto's Principle and how its
use in real world practices including high end complex simulations.

I say Pareto here because for YOU, your OCR application, it applies
VERY nicely.

Side note: The RISC computer is based on Pareto's principle, it uses
the MOST used instructions sets 80% of the time - that is why it is
faster. As the story goes, myth or not, the inventor of the RISC chip
use his disorganize mess on his desk to get that flash of genuis. He
notice that his most important papers and memos where migrating to the
top of the PILE and the the least used one went to the bottom never
hardly to be used or seen again. He calculated the near MAGIC pareto
ratio of 80:20 or 80% that is so apparent in so many things in real
life. So he invented the Reduced Instruction Set Chip - RISC!

For your OCR, I take it there is a character set. One optimization
would be to use pareto's principle here to make sure that 80% of the
most use characters are "more" accessible or "faster" than the rest.

There are all kinds of ideas here. That is why Geoff, Joe, and I and
others who provided input has said you can not use your primitive
single-thread process modeling as a BEST CASE scenario - you can't
because it would be incorrect to believe it would be when you didn't
even try to leverage anything the the computer and hardware offers.

--
HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 2:22:30 PM3/22/10

to

See below...

****
How? I have no idea how to predice L3 cache performance on an i7 system, and I don't
believe they do, either. No theoretical model exists that is going to predict actual
behavior, short of a detailed simulation,and I talked to Intel and they are not releasing
performance statistics, period, so there is no way short of running the experiement to
obtain a meaningful result.
****

>
>(2) When a process requires essentially random (mostly
>unpredictable) access to far more memory than can possibly
>fit into the largest cache, then actual memory access time
>becomes a much more significant factor in determining actual
>response time.

****
What is your cache collision ratio, actually? Do you really understand the L3 cache
replacement algorithm? (I can't find out anything about it on the Intel site! So I'm
surprised you have this information, which Intel considers Corporate Confidential)
****

>> Joseph M. Newcomer [MVP]

Hector Santos

unread,

Mar 22, 2010, 2:25:36 PM3/22/10

to

Hector Santos wrote:

> I say Pareto here because for YOU, your OCR application, it applies VERY
> nicely.
>

> ..
>

> For your OCR, I take it there is a character set. One optimization
> would be to use pareto's principle here to make sure that 80% of the

> most use characters are "more" accessible or5 "faster" than the rest.

To simulate this, you would basically have two random calculations in
ProcessData():

1) Random selection to decide to use 80% group or 20% group
2) Random selection to decide to which index within the group.

Of course, you have to decide what characters are the most used.

In the simulator (version 3 with the file mapping version), I used one
file mapping class that allows me to quickly do something like this:

typedef struct tagTData {
char reserve[512];
} TData;

TFileMap<TData> fmdata("d:\\largefilemap.dat", 3000000);

which will open/create a ISAM file with TData Records and file map the
file handle. 3000000 * 512 equals 1.5 GB

The class allows me to access the records by index using operators, so
I can do something like this:

TData rec = fmdata[i];

I can change the TData to this:

typedef struct tagTData {
BOOL mostused; // set if its a most used item
char reserve[508];
} TData;

or I can use two maps, one most used and the other least used:

TFileMap<TData> mfmdata("d:\\largefilemap-most.dat", 3000000 *.80);
TFileMap<TData> lfmdata("d:\\largefilemap-least.dat", 3000000 *.20);

Either way, you can simulate the idea of most vs least to get a very
close emulation of what your system and prediction of what to expect.

--
HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 2:32:02 PM3/22/10

to

See below...

****
He has NO CLUE as to what a "memory-mapped file" actually is. This last comment indicates
total and complete cluelessness, plus a startling inabilitgy to understand that we are
making USEFUL suggestions because WE KNOW what is going on and he has no idea.

Like you, I'm giving up. There is only so long you can beat someone over the head with
good ideas which they reject because they have no idea what you are talking about, but
won't expend any energy to learn about, or ask questions about. Since he doesn't
understand what shared sections are, or what they buy, and that a MMF is the way to get
shared sections, I'm dropping out of this discussion. He has found a set of "experts" who
agree with him (your example apparently doesn't convey the problem correctly), thinks
memory-mapped files limit access to disk speed (not even understanding they are FASTER
than ReadFile!) and has failed utterly to understand even the most basic concepts of an
operagin system (thinking it is like an automatic transmission, where you can use it
without knowing or caring about how it works, when what he is really doing is trying to
build a competition racing machine and saying "all that stuff about the engine is
irrelevant", whereas anyone who does competition racing (like my next-door neighbor did
for years) knows why all this stuff is critical. If he were a racer, and we told him
about power-shiftting (shifting a manual transmission without involving the clutch), he'd
tell us he didn't need to understand that.

Sad, really.
joe

Hector Santos

unread,

Mar 22, 2010, 2:30:46 PM3/22/10

to

Joseph M. Newcomer wrote:

>> (1) People in a more specialized group are coming to the
>> same conclusions that I have derived.
> ****
> How? I have no idea how to predice L3 cache performance on an i7 system, and I don't
> believe they do, either. No theoretical model exists that is going to predict actual
> behavior, short of a detailed simulation,and I talked to Intel and they are not releasing
> performance statistics, period, so there is no way short of running the experiement to
> obtain a meaningful result.
> ****

Have you seen the posted C/C++ simulator and proof that shows how
using multiple threads and shared data trumps his single main thread
process theory?

>> (2) When a process requires essentially random (mostly
>> unpredictable) access to far more memory than can possibly
>> fit into the largest cache, then actual memory access time
>> becomes a much more significant factor in determining actual
>> response time.
> ****
> What is your cache collision ratio, actually? Do you really understand the L3 cache
> replacement algorithm? (I can't find out anything about it on the Intel site! So I'm
> surprised you have this information, which Intel considers Corporate Confidential)
> ****

Well, the thing is joe, is that this chip cache is something he will
using. This application will be use the cache the OS maintains.

He is thinking about stuff that he shouldn't be worry about. He
thinks his CODE deals directly with the chip caches.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 2:57:13 PM3/22/10

to

Joseph M. Newcomer wrote:

>>>
>>> He has been told that MMF can help him.
>>>
>>> --
>>> HLS
>> Since my process (currently) requires unpredictable access
>> to far more memory than can fit into the largest cache, I
>> see no possible way that adding 1000-fold slower disk access
>> could possibly speed things up. This seems absurd to me.
> ****
> He has NO CLUE as to what a "memory-mapped file" actually is. This last comment indicates
> total and complete cluelessness, plus a startling inabilitgy to understand that we are
> making USEFUL suggestions because WE KNOW what is going on and he has no idea.

What he doesn't realize is that his 4GB loading is already
virtualized. He believes that all of that is in pure RAM. The pages
fault prove that point but he doesn't understand what that means.

He doesn't realize that his PC is techically a VIRTUAL MACHINE! He
doesn't understand the INTEL memory segmentation framework. Maybe he
this its DOS? That is why I said if he wants PURE RAM operations, he
might be better off with a 16 bit DMPI DOS program or moving over to a
MOTOROLA chip that will over offer a linear memory model - if that is
still true today.

> Like you, I'm giving up.

There are two parts:

First, I'm actually exploring scaling methods with the simulator I
wrote for him. I have a version where I am exploring NUMA that will
leverage 2003+ Windows technology. I am going to pencil in getting a
test computer with a Intel XEON that offer NUMA.

Second, get some good will out of this if I can convince this guy that
he needs to change his application to better perform. Or at least
understand this his old memory usage paradigm for processes does not
apply under Windows. The only reason I can suspect for his ignorance
is that he is not a programmer or at the very least, very primitive
nature of programming knowledge. A real Windows programmer would
under this this basic principles or at least explore what experts are
saying. He is not even exploring anything!

> I'm dropping out of this discussion.

I should too.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 2:58:55 PM3/22/10

to

Hector Santos wrote:

>
> Well, the thing is joe, is that this chip cache is something he will
> using.

I meant "is NOT something..."

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 3:31:09 PM3/22/10

to

Perhaps you did not understand what I said. The essential
process inherently requires unpredictable access to memory
such that cache spatial or temporal locality of reference
rarely occurs.

"Hector Santos" <sant...@gmail.com> wrote in message
news:e2aedb82-c9ad-44b3...@c16g2000yqd.googlegroups.com...

Hector Santos

unread,

Mar 22, 2010, 3:39:43 PM3/22/10

to

Yet again, the PROOF is not enough for you.

What you don't understand is that YOU, YOUR APPLICATION will never
deal with chip caching.

Your application deals with working sets and VIRTUAL MEMORY.

You proved that when you indicated the PAGE FAULTS - thats VIRTUAL
MEMORY OPERATIONS.

It has nothing to do with the L1, L2, L3 CHIP CACHING.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 3:37:01 PM3/22/10

to

Peter Olcott wrote:

> You keep bringing up memory mapped files. Although this may
> very well be a very good way to use disk as RAM, or to load
> RAM from disk, I do not see any possible reasoning that
> could every possibly show that a hybrid combination of disk
> and RAM could ever exceed the speed of pure RAM alone.

The reason is that you are not using pure RAM for the entire load of
data. Windows will virtualize everything. In basic terms, there are
two kinds - Windows using memory maps internally, that is how you get
system pages and ones that applications create themselves.

What you think you are still is "uninterrupted work" but it is
interrupted - thats called a Preemptive Operating System so your
application is never always in an active running state - you are
perceiving that it is, but it is not.

Think of it as a video picture frames. You perceive an interrupted
live animation or motion - they reality is that this are picture snap
shots (frames) displayed very rapidly and there are time gaps between
frames! For a PC, its called context switching and these gaps allow
other things to run.

The same with MEMORY - it is virtualized, even if you have 8GB!

Unless you tell windows:

Please do not CACHE this memory

then it is CACHED MEMORY.

So you have to make an EXPLICIT instruction via your CODE to tell
Windows not to CACHE your memory.

Your application because you don't know how to do this, is using
BUFFER I/O, CACHE, VIRTUALIZE MEMORY - by default.

> Reasoning is the ONLY source of truth that I trust,
> all other sources of truth are subject to errors.

Reasoning comes first by understanding the technology. If you don't,
then you have no right to judge experts, or anything else or presume
any conclusions about it where it ignorantly contradicts the realities
- realities understood by experts and those understanding the
technology at very practical levels.

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 3:40:57 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:j0dfq5d859v5mbgtj...@4ax.com...

Try and explain exactly how cache can possibly help when
there is most often essentially no spatial or temporal
locality of reference.

Peter Olcott

unread,

Mar 22, 2010, 3:43:09 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:ecdfq5lb57qrou47d...@4ax.com...

This is very likely true. Let's just drop this one until
someone explains all of the little nuances of exactly how
cache can greatly improve performance in the case where
there is essentially no spatial or temporal locality of
reference.

> total and complete cluelessness, plus a startling

Peter Olcott

unread,

Mar 22, 2010, 3:45:59 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:uqklKcfy...@TK2MSFTNGP04.phx.gbl...

> Peter Olcott wrote:
>
>> You keep bringing up memory mapped files. Although this
>> may very well be a very good way to use disk as RAM, or
>> to load RAM from disk, I do not see any possible
>> reasoning that could every possibly show that a hybrid
>> combination of disk and RAM could ever exceed the speed
>> of pure RAM alone.
>
>
> The reason is that you are not using pure RAM for the
> entire load of data. Windows will virtualize everything.
> In basic terms, there are two kinds - Windows using memory
> maps internally, that is how you get system pages and ones
> that applications create themselves.

It loads my data and then the process monitor tells me that
their are no page faults even when the process is invoked 12
hours later.

Hector Santos

unread,

Mar 22, 2010, 3:45:21 PM3/22/10

to

Peter Olcott wrote:

> Try and explain exactly how cache can possibly help when
> there is most often essentially no spatial or temporal
> locality of reference.

Its called WINDOWS Virtual Memory Caching technology.

This is not DOS. You are not dealing directly with the CHIP here.

You need to stop reading stuff out, finding a new "buzz word" thinking
you got a "AH HA" and believe it proves your erroneous understanding
of Windows programming.

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 3:56:23 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:ehG4qdfy...@TK2MSFTNGP04.phx.gbl...

> Yet again, the PROOF is not enough for you.
>
> What you don't understand is that YOU, YOUR APPLICATION
> will never deal with chip caching.
>
> Your application deals with working sets and VIRTUAL
> MEMORY.
>
> You proved that when you indicated the PAGE FAULTS - thats
> VIRTUAL MEMORY OPERATIONS.
>
> It has nothing to do with the L1, L2, L3 CHIP CACHING.

No not quite nothing, both chip caching and page faults have
to do with memory access.

L1, L2, L3 chip caching are also very dependent upon spatial
and/or temporal locality of reference. Maybe you know a way
that chip caching can work without either temporal or
spatial locality or reference, I do not.

As far as I can tell the only possible general approach to
chip caching that could possibly work that does not depend
upon locality of reference would be for the chip to somehow
comprehend enough of the underlying algorithm to predict
memory access patterns.

Joseph M. Newcomer

unread,

Mar 22, 2010, 3:57:43 PM3/22/10

to

On Mon, 22 Mar 2010 11:14:27 -0400, Hector Santos <sant...@nospam.gmail.com> wrote:

>Peter Olcott wrote:
>
>
>> A group with a more specialized focus is coming to the same

>> conclusions that I have derived.
>

>Oh Peter, you're fibbing! The simulator I provided is a classic
>example of an expert on the subject in action. If you wanted to learn
>anything here, you should study it.

****
Actually, I believe he is telling the truth. He has fallen into a group that is largely
clueless also, but they look like experts because they are agreeing with him.
****

>
>The process handler emulates your MEMORY ACCESS claims to the fullest
>extent with minimum OP CODES of any other work. Any engineer (and by
>the way, I am trained Chemical Engineer) with process control and
>simulation experience can easily see the work I showed as proof in
>invalidating your understanding and shows how multi-threads with
>shared memory is superior to your single main thread process idea.

***
In science, it takes only ONE counterexample to sink any theory. You have the
counterexample, Peter has the theory. Q.E.D.
****
>
>If you can see that in the code, then quite honestly, you don't know

>how to program or understand the concept of programming.

Joseph M. Newcomer

unread,

Mar 22, 2010, 4:11:05 PM3/22/10

to

See below...

On Mon, 22 Mar 2010 14:30:46 -0400, Hector Santos <sant...@nospam.gmail.com> wrote:

>Joseph M. Newcomer wrote:
>
>>> (1) People in a more specialized group are coming to the
>>> same conclusions that I have derived.
>> ****
>> How? I have no idea how to predice L3 cache performance on an i7 system, and I don't
>> believe they do, either. No theoretical model exists that is going to predict actual
>> behavior, short of a detailed simulation,and I talked to Intel and they are not releasing
>> performance statistics, period, so there is no way short of running the experiement to
>> obtain a meaningful result.
>> ****
>
>
>Have you seen the posted C/C++ simulator and proof that shows how
>using multiple threads and shared data trumps his single main thread
>process theory?

***
Yes. Note that I mentioned that your counterexample trumps his theory. His theory is so
full of holes it is hard to imagine why he is clinging to it with such ferocity, given we
keep telling him he is wrong.

And the CORRECT apporach, if he believed that your code doesn't represent his problem
domain, would be to read it, modify it to fit his model, and run it. But that would
potentially expose his theory to absolute destruction, or give him useful data by which he
could dtermine what is going to happen, and neither of those seems to be his priority. His
failure to undertand or even look into Memory Mapped Files, but instead come up with some
off-the-wall idea of how they behave which, unfortunately, is not at all like they
ACTUALLY behave, or realize that using them with a shared mapping object would reduce the
memory footprint of multiple processes, is indicsative of a completely closed mind. We
are really wasting our time here; he doesn't want to get answers, just tell us why we are
wrong. And ignoring the fact that we have been doing multithreading decades longer than
he has. I've been doing it since 1975. Or 1968, depending on what criteria you apply.
ANd when I point out obvious aspects he has ignored, such as cache influence, he tells
me he doesn't need to know this, because the wants to think of the OS as a "black box"
that works according to his imagined ideals, not how actual operating systems work
****

>
>
>>> (2) When a process requires essentially random (mostly
>>> unpredictable) access to far more memory than can possibly
>>> fit into the largest cache, then actual memory access time
>>> becomes a much more significant factor in determining actual
>>> response time.
>> ****
>> What is your cache collision ratio, actually? Do you really understand the L3 cache
>> replacement algorithm? (I can't find out anything about it on the Intel site! So I'm
>> surprised you have this information, which Intel considers Corporate Confidential)
>> ****
>
>
>Well, the thing is joe, is that this chip cache is something he will
>using. This application will be use the cache the OS maintains
>

>He is thinking about stuff that he shouldn't be worry about. He
>thinks his CODE deals directly with the chip caches.

****
A fact I keep beating him over the head with, but he chooses to ignore reality and
experience over errnoneous experiments that have provided no useful information. Note
that all his ranting is based on ONE experiment betweeen incomparable systems that
measures only ONE thread, and fails to take into account nonlinearities of caching. And
he refuses to listen to alternative suggestions because he misunderstands the technology
and doesn't appreciate what is really going on inside the OS or the hardware.

But that's pretty obviuos, which is why I've givne up making suggestions; he simply won't
listen to anyone excpet this hypothetical group of experts who must be right because they
agree with him.
joe

Peter Olcott

unread,

Mar 22, 2010, 4:19:06 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:nqifq5doopkg19p39...@4ax.com...

That would be true. Try to explain how any current caching
system could provide any significant benefit with essential
no spatial or temporal locality of reference. If this can
not be done, then this single point may make all of the
other lines-of-reasoning moot.

Peter Olcott

unread,

Mar 22, 2010, 4:25:59 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:ea3d0gfy...@TK2MSFTNGP02.phx.gbl...

> Peter Olcott wrote:
>
>> Try and explain exactly how cache can possibly help when
>> there is most often essentially no spatial or temporal
>> locality of reference.
>
>
> Its called WINDOWS Virtual Memory Caching technology.
>
> This is not DOS. You are not dealing directly with the
> CHIP here.

I know that. I also know the inherent memory access patterns
of my algorithm.

Joe keeps bringing up how complex the actual underlying
memory access patterns are when one also considers cache.

I keep brining up that there can be no complex underlying
memory access patterns if because of lack of spatial and
temporal locality of reference cache can mostly not be used.

I am beginning to think you two guys are stuck in "refute
mode", yet I remain open to the possibility that it may be
me and neither of you.

Hector Santos

unread,

Mar 22, 2010, 4:23:25 PM3/22/10

to

Peter Olcott wrote:

> "Hector Santos"

>> It has nothing to do with the L1, L2, L3 CHIP CACHING.

> As far as I can tell the only possible general approach to
> chip caching that could possibly work that does not depend
> upon locality of reference would be for the chip to somehow
> comprehend enough of the underlying algorithm to predict
> memory access patterns.

Here's the thing:

Why are you worrying about this when you don't even know how to
program for it any any level?

You are at at the USER LEVEL, not KERNEL LEVEL or FILE DRIVER LEVEL!!

Do you really think your application needs Advanced Memory Chip
technology in order to work?

Do you really think this all works in slow motion?

Your application is extremely primitive - it is really is. You have
too much belief that your application is BEYOND the needs of any other
application with load data needs.

You have no engineering sense whatsoever. Even if you say that your
response time is 100ms - SO WHAT if its 1100 ms with multi-threads?

If your response time is 100ms, worrying about CHIP LEVEL stuff is crazy.

--
HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 4:28:33 PM3/22/10

to

See below...

On Mon, 22 Mar 2010 14:40:57 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>Try and explain exactly how cache can possibly help when
>there is most often essentially no spatial or temporal
>locality of reference.
>
****

While caches work well with locality of reference, that is just a heuristic for predicting
cache effects. Locality of reference is not the point; maximizing cache hits is the
point. And this can happen, particularly on a shared L3 cache, based solely on the cache
replacement algorithm. We use locality of reference as the "easy" approach to determining
the likelikhood of cache hits, because it is easy to analyze in applications that process
regular data like matrices and arrays. But it is not the theoretical optimum approach. If
you took the time to understand how caches work this would be obvious to you.

Try to explain why you believe this when you have run no experiments that have any
meaning. The difference is that I am saying YOU HAVE NO DATA, and you are saying I KNOW
WHAT IS GOING TO HAPPEN, I DON'T NEED NO STINKIN' FACTS. I don't believe you really know
what is going to happen, you are just guessing. I know what I would do: as an egineer
(there's that nasty word again) I'd go out at GET the facts. Then, I could say "But I
have run this experiment, and it substantiates my theory" and that would be useful
knowledge. But you just blindly claim you "know" what is going to happen. I'm supposed
to take this seriously from someone who dosen't even understand why a Memory Mapped File
is going to give superior performance? Given your demonstrated lack of understanding of
operating systems, why should I believe ANY assertion you make, unless you have the data
to back it up? Hell, I wouldn't believe MY OWN theories about performance without data,
and 15 years of performance measurement have convinced me of one absolute fact: "Ask a
programmer when the performance bottleneck is in their code, and you will get a wrong
answer". That rule NEVER failed me in 15 years of real performance measurement of real
programs on real machines, and I believe it today. Botom line: only actual performance
data proves anything. Theories about where performance is going are universally wrong
unless supported by actual measurements. You have a p-baked theory, for p considerably
less than 0.5 (p==0.5 is half-baked), and you refuse to see test your theory. Not a
robust approach to building systems. You may be absolutely correct, but you cannot PROVE
it without data.

Joseph M. Newcomer

unread,

Mar 22, 2010, 4:33:07 PM3/22/10

to

I have a NUMA machine, and AMD dual-chip dual-core (4-core) system running WIn32 (Vista),
so if you need some tests run, email the code to me.

Remember when he wanted his data allocating in CONTIGUOUS PHYSICAL memory? He is really
clueless about how operating systems work, but won't listen to ANYONE whose ideas don't
match his preconceived notions about how the world should work to maximize his
convenience. EVen if what we're trying to do is explain how reality works.

I wonder if he knows what TLB thrashing is?
joe
*****

Peter Olcott

unread,

Mar 22, 2010, 4:47:46 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:rnjfq5ls8fpma0kvr...@4ax.com...

I do have data, and present this data many times and you two
simply blow it off.
Two processes take 2.75 times as long as one process. What
could this mean besides resource contention?

You tell me all about pages faults, yet the process monitor
reports zero page faults, and you continue to claim that its
all about page faults, and virtual memory. Pages faults
indicate victual memory usage right? A lack of page faults
indicates a lack of virtual memory usage right?

Peter Olcott

unread,

Mar 22, 2010, 4:52:54 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:ecdfq5lb57qrou47d...@4ax.com...

http://en.wikipedia.org/wiki/Memory-mapped_file
Apparently I do.

Pete Delgado

unread,

Mar 22, 2010, 4:53:51 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in message

news:9qkfq51noaiib6eva...@4ax.com...

>I have a NUMA machine, and AMD dual-chip dual-core (4-core) system running
>WIn32 (Vista),
> so if you need some tests run, email the code to me.
>
> Remember when he wanted his data allocating in CONTIGUOUS PHYSICAL memory?
> He is really
> clueless about how operating systems work, but won't listen to ANYONE
> whose ideas don't
> match his preconceived notions about how the world should work to maximize
> his
> convenience. EVen if what we're trying to do is explain how reality
> works.

Joe,
The amzing thing to me is the incredible amount of patience you have had
with Mr. Olcott. You have referenced Richter's book which would explain
everything he needs to know in a few short pages and yet he has resisted.
I've been reading this thread with a combination of amusement and
bewilderment. I guess the old adage that "you can lead a horse to water, but
you cannot make him drink" has never been more appropriate. :-)

-Pete

Pete Delgado

unread,

Mar 22, 2010, 5:10:13 PM3/22/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message
news:osWdnaGZ3q06RTrW...@giganews.com...

>
> "Joseph M. Newcomer" <newc...@flounder.com> wrote in message
> news:ecdfq5lb57qrou47d...@4ax.com...

>> ****
>> He has NO CLUE as to what a "memory-mapped file" actually is. This last
>> comment indicates
>
> http://en.wikipedia.org/wiki/Memory-mapped_file
> Apparently I do.

I think you would be far better served by looking at Windows specific
information on memory mapped files such as that which Joe suggested to you
some time ago: Richter's Programming Applications for Microsoft Windows 4th.

-Pete

Joseph M. Newcomer

unread,

Mar 22, 2010, 5:13:35 PM3/22/10

to

Or, as we used to say when I was University faculty:

"You can lead a student to knowledge, but you can't make him think"

I got a lot of flack when I gave an examine that required students USE the knowledge I'd
given them for the last two weeks, plus the knowledge they would have gained from doing
the homework assignment. Because they couldn't do a "memory dump" from my PowerPoint
slides, they thought the test "unfair".
joe

Joseph M. Newcomer

unread,

Mar 22, 2010, 5:21:25 PM3/22/10

to

No, you are the one stuck in "refute mode". I keep insisting that the ONLY way you can
refute what we are saying is by running the actual experiment, and you keep saying, no,
you KNOW that it will fail. Without data, you have no way to validly assert this.
joe

Joseph M. Newcomer

unread,

Mar 22, 2010, 5:24:56 PM3/22/10

to

See below...

***
An what does a two-process measure tell you about multithreading? NOTHING!

You have a flawed experiment that is generalizing from the wrong premises.

OF COURSE two huge processes are going to have resource contention! And are these running
on a single core or multicore machine? But multiple threads have signficant lower
resource conention, and you ignore this fact and think your two-process model is the
absolute authoritative experiment. I'd never trust data like this, and if I were a
product manager I'd send you back to get real data. You are your own product manager, and
should recognize (give how much we've told you) and you should send yourself back to get
valid data.
joe

Peter Olcott

unread,

Mar 22, 2010, 5:27:48 PM3/22/10

to

"Pete Delgado" <Peter....@NoSpam.com> wrote in message
news:OIehRQgy...@TK2MSFTNGP04.phx.gbl...

Joe kept insisting and continues to insist that my data is
not resident in memory.

After loading my data and waiting twelve hours the process
monitor reports zero page faults, when I execute my process
and run it to completion.

How does this not prove Joe is wrong (At least in the
specific instance of one execution of my process)?
(1) The process monitor is lying.
(2) Page faults do not measure virtual memory usage.

Peter Olcott

unread,

Mar 22, 2010, 5:32:17 PM3/22/10

to

You tell me all about pages faults, yet the process monitor
reports zero page faults, and you continue to claim that its
all about page faults, and virtual memory. Pages faults
indicate victual memory usage right? A lack of page faults
indicates a lack of virtual memory usage right?

How does this not prove that my data is in memory and thus
you are wrong when you say that my data is not resident in
memory?

Load data,
wait twelve hours
check page faults reported
execute process to completion
same number of pages faults as before the 12 hours
therefore data remained resident in memory

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:arnfq5h441qhdl7hn...@4ax.com...

Joseph M. Newcomer

unread,

Mar 22, 2010, 5:32:03 PM3/22/10

to

See below...

***
And obviously you do not. You read one general wikipedia entry and fail to notice how it
said that it improves I/O performance, and you have thought that this general article is
definitive.

OK, since you know all the answers, explain this: if I do CreateFileMapping and name a
mapping object, and use the same name in two processes, what happens to my virtual memory
maps when I do MapViewOfFile naming that mapping object handle? Please limit your
response to fewer than 100 words, which should be enough to explain the implications.

Note if you knew the correct answer, you would not have made the ridiculous statements you
have made about memory mapped files. And you would have known that they will improve
multiprocess performance.

I repeat my assertion: you are clueless. Anyone who tries to use a generic wikipedia
article as proof of being not clueless is metaclueless.
joe
****

Joseph M. Newcomer

unread,

Mar 22, 2010, 5:33:55 PM3/22/10

to

see below...

On Mon, 22 Mar 2010 12:15:27 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Hector Santos" <sant...@nospam.gmail.com> wrote in message

>news:uGVpjmdy...@TK2MSFTNGP06.phx.gbl...

>> Peter Olcott wrote:
>>
>>> Since my process (currently) requires unpredictable
>>> access to far more memory than can fit into the largest
>>> cache, I see no possible way that adding 1000-fold slower
>>> disk access could possibly speed things up. This seems
>>> absurd to me.
>>
>>

>> And I would agree it would be seem to be absurd to
>> inexperience people.
>>
>> But you need to TRUST the power of your multi-processor
>> computer because YOU are most definitely under utilizing
>> it by a long shot.
>>
>> The code I posted is the proof!
>
>If it requires essentially nothing besides random access to
>entirely different places of 100 MB of memory, thenn (then
>and only then) would it be reasonably representative of my
>process. Nearly all the my process does is look up in memory
>the next place to look up in memory.
***
Then the CORRECT approach is not to say "I don't want to waste time reading your example
because it doesn't match my problem", the CORRECT approach is to say "I have no modified
your example to more closely resemble my access patterns, ran it, and got THIS result" and
show the results you got.
joe

>
>>
>> Your issue is akin to having a pickup truck, overloading
>> the back, piling things on each other, overweight beyond
>> the recommended safety levels per specifications of the
>> car manufacturer (and city/state ordinances), and now your
>> driving, speed, vision of your truck are all altered.
>> Your truck won't go as fast now and if even if you could,
>> things can fall, people can die, crashes can happen.
>>
>> You have two choices:
>>
>> - You can stop and unload stuff and come back and pick
>> it up on
>> 2nd strip, your total travel time doubled.
>>
>> - you can get a 2nd pick up truck, split the load and
>> get
>> on a four lanes highway and drive side by side,
>> sometimes
>> one creeps ahead, and the other moves ahead, and
>> both reach
>> the destination at the near same expected time.
>>
>> Same thing!
>>
>> You are overloading your machine to the point it working
>> very very hard to satisfy your single thread process
>> needs. You may "believe" it is working at optimal speeds
>> because it has uninterrupted exclusive access but it is
>> not reality. You are under utilizing the power of your
>> machine.
>>
>> Whether you realize it or not, the overloaded pickup truck
>> is smart and is stopping you every X milliseconds checking
>> if you have a 2nd pickup truck to offload some work and do
>> some moving for you!!
>>
>> You need to change your thinking.
>>
>> However, at this point, I don't think you have any coding
>> skills, because if you did, you would be EAGERLY JUMPING
>> at the code I provided to see for yourself.
>>
>> --
>> HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 5:36:45 PM3/22/10

to

See below...

I already pointed out that locality of reference is not the real criterion, just the one
that we often use because it is easiest to analyze and control. And while locality of
reference maximizes cache hits, it is not the ONLY technique that could maximize cache
hits. You have fastened on this heuristic and think of it has a hard rule.
joe
****

Peter Olcott

unread,

Mar 22, 2010, 5:38:14 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:42ofq5h3d80a2pmmk...@4ax.com...

Apparently within the context of virtual memory usage, **
which I have shown in at least one instance does not apply.
Zero page faults indicates zero virtual memory usage right?

** Do virtual memory maps apply to anything else besides
virtual memory?

>
> I repeat my assertion: you are clueless. Anyone who
> tries to use a generic wikipedia
> article as proof of being not clueless is metaclueless.
> joe
> ***
>>

Peter Olcott

unread,

Mar 22, 2010, 5:44:49 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:ifofq5hdvoesvnv8a...@4ax.com...

I think that you must have paraphrased me incorrectly and I
don't see the context in this post. I would not likely have
concluded that it does not match my problem, and in fact
remember asking Hector if it did match my problem. I stated
my problem much more concisely than he stated his proof.

The single sentence is above. Hector never told me whether
or not he thought that it matched my algorithm's memory
usage pattern.

Peter Olcott

unread,

Mar 22, 2010, 5:49:54 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:qiofq5hp1tne6ref0...@4ax.com...

If there is no spatial or temporal locality of reference and
the memory usage pattern is totally unpredictable, and the
memory usage is far greater than can possibly fit into the
largest cache, what is left that cache can do?

If all of the types of cache can be categorically denied,
then all of the complexity of cache becomes moot when
performance is determined.

Hector Santos

unread,

Mar 22, 2010, 5:49:29 PM3/22/10

to

Peter Olcott wrote:

> Joe kept insisting and continues to insist that my data is
> not resident in memory.

If you have a 32 bit Windows OS, you are limited to just 2GB RAW
ACCESS and 4GB of VIRTUAL MEMORY.

If your process is loading 4GB, you are using virtual memory.

> After loading my data and waiting twelve hours the process
> monitor reports zero page faults, when I execute my process
> and run it to completion.

You're lying, you told me you have PAGE FAULTS but it settle down to
zero, which is NORMAL. But start a 2nd process and you will get page
faults.

I also asked, now 5 times, to provide the MEMORY LOAD percentage which
I even provided with a simple C program that you can compile, and you
did not:

// File: V:\bin\memload.cpp

#include <stdio.h>
#include <windows.h>

void main(char argc, char *argv[])
{
MEMORYSTATUS ms;
ms.dwLength = sizeof(ms);
GlobalMemoryStatus(&ms);
printf("Memory Load: %d%%",ms.dwMemoryLoad);
}

Why can't you even do that?

> How does this not prove Joe is wrong (At least in the
> specific instance of one execution of my process)?
> (1) The process monitor is lying.
> (2) Page faults do not measure virtual memory usage.

There are now what 4-5 participants in the thread who are telling your
thinking is wrong and lack a understanding of the Windows and Intel
hardware.

lets get a few more like this guy with a somewhat layman description:

http://blogs.sepago.de/helge/2008/01/09/windows-x64-all-the-same-yet-very-different-part-1/

and the #1 guy at Microsoft today!

http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx

If you DEFY what Mark Russinovich is saying here, you are CRAZY!

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 5:59:34 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:%23F2oLmg...@TK2MSFTNGP06.phx.gbl...

> Peter Olcott wrote:
>
>> Joe kept insisting and continues to insist that my data
>> is not resident in memory.
>
>
> If you have a 32 bit Windows OS, you are limited to just
> 2GB RAW ACCESS and 4GB of VIRTUAL MEMORY.

Yes, and that is another thing. I kept saying that I have a
64bit OS, and Joe kept forming his replies in terms of a
32-bit OS.

>
> If your process is loading 4GB, you are using virtual
> memory.
>
>> After loading my data and waiting twelve hours the
>> process monitor reports zero page faults, when I execute
>> my process and run it to completion.
>
>
> You're lying, you told me you have PAGE FAULTS but it
> settle down to zero, which is NORMAL. But start a 2nd
> process and you will get page faults.

I only get the page faults until the data is loaded. After
the data is loaded I get essentially no more page faults,
even after waiting twelve hours before running my process to
completion. After proving that my data is resident in RAM
Joe continues to chide me for claiming that my data is
resident in RAM.

You guys just playing head games with me?

Hector Santos

unread,

Mar 22, 2010, 6:02:08 PM3/22/10

to

Peter Olcott wrote:

> You tell me all about pages faults, yet the process monitor
> reports zero page faults, and you continue to claim that its
> all about page faults, and virtual memory.

Its not a claim - its a fact.

> Pages faults indicate victual memory usage right?

It shows when your PROCESS is asking too much the can provide to you
all in memory - it has to virtualize it.

> A lack of page faults indicates a lack of virtual memory usage right?

No. If its zero or not changing and I know your process is not, it
means that your process working set is not demanding more than it can
handle or OTHER processes have not chewed up memory, limiting your
available memory.

You have NO control over this UNLESS you explicitly told windows to
use NO-CACHING, NO BUFFER I/O memory.

Why do you refuse to believe this? Every Windows programmer has to
know this to some degree.

> How does this not prove that my data is in memory and thus
> you are wrong when you say that my data is not resident in
> memory?

It doesn't.

Allow a Technical Fellow Engineer at Microsoft explain it all to you:

Pushing the Limits of Windows: Physical Memory
http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx

Pushing the Limits of Windows: Virtual Memory
http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx

Pushing the Limits of Windows: Paged and Nonpaged Pool
http://blogs.technet.com/markrussinovich/archive/2009/03/26/3211216.aspx

Are you going to argue with Mark too?

--
HLS

Hector Santos

unread,

Mar 22, 2010, 6:13:57 PM3/22/10

to

Peter Olcott wrote:

> I think that you must have paraphrased me incorrectly and I
> don't see the context in this post. I would not likely have
> concluded that it does not match my problem, and in fact
> remember asking Hector if it did match my problem. I stated
> my problem much more concisely than he stated his proof.
>
> The single sentence is above. Hector never told me whether
> or not he thought that it matched my algorithm's memory
> usage pattern.

I sure did, which you replied to the message, but decided to ignore
the message details.

The simulator provided the WORST CASE scenario of a serialize reading
of your entire payload with single and multiple threads with NO other
overhead. This is the highest pressure you can get for X number of
threads reading the same memory. It proved your understanding of the
memory I/O, caching and multi-threaded nature of a 32 bit preemptive
operating system is extremely primitive.

I also provided the scenario of random, unpredictable reading with
single and multiple threads and proved how it was even faster because
you are not reading the entire payload but parts of it randomly.

Until you run the simulator, you will never grasp how fundamentally
lacking your thinking is about Windows.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 6:21:44 PM3/22/10

to

>> If you have a 32 bit Windows OS, you are limited to just

>> 2GB RAW ACCESS and 4GB of VIRTUAL MEMORY.
>
> Yes, and that is another thing. I kept saying that I have a
> 64bit OS, and Joe kept forming his replies in terms of a
> 32-bit OS.

"thats another thing" - oh stop it. You did not keep saying that, in
fact, I only recall ONCE where you said Windows 7 32bit QUAD 8GB machine.

Did you compile your code for 64 bit or 32 bit? Are you fully 100%
sure that all your VARIABLES and all the I/O with your variables is 64bit?

Forgive me if I am wrong, but you have shown no programming or
engineering tenacity whatsoever to indicate you know how to a) program
or b) convert other people's code which I believe is all you have, to
deal with 32 bit programming yet a lone 64bit programming.

--
HLS

Hector Santos

unread,

Mar 22, 2010, 6:23:44 PM3/22/10

to

Pete Delgado wrote:

> "Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

>>> He has NO CLUE as to what a "memory-mapped file" actually is. This last
>>> comment indicates
>> http://en.wikipedia.org/wiki/Memory-mapped_file
>> Apparently I do.
>
> I think you would be far better served by looking at Windows specific
> information on memory mapped files such as that which Joe suggested to you
> some time ago: Richter's Programming Applications for Microsoft Windows 4th.

And I suggested waaaaaaaay back in the beginning of this thread. :) I
even gave him a link for a sweet CMemoryMapFile class at MSDN!

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 7:46:24 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:exGTQtgy...@TK2MSFTNGP06.phx.gbl...

> Peter Olcott wrote:
>
>> You tell me all about pages faults, yet the process
>> monitor
>> reports zero page faults, and you continue to claim that
>> its
>> all about page faults, and virtual memory.
>
>
> Its not a claim - its a fact.
>
>> Pages faults indicate victual memory usage right?
>
>
> It shows when your PROCESS is asking too much the can
> provide to you all in memory - it has to virtualize it.
>
>> A lack of page faults indicates a lack of virtual memory
>> usage right?
>
>
> No. If its zero or not changing and I know your process is
> not, it means that your process working set is not
> demanding more than it can handle or OTHER processes have
> not chewed up memory, limiting your available memory.

OK so zero page faults does not mean that virtual memory is
not being used?
(1) YES zero page faults means that virtual memory is not
active on this process
(2) Not (YES zero page faults means that virtual memory is
not active on this process)

Which is it (1) or (2) ??? Any hem hawing will be taken
as intentional deceit

Peter Olcott

unread,

Mar 22, 2010, 7:50:33 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:ufEXU5gy...@TK2MSFTNGP02.phx.gbl...

I have proven that this is moot, and this proof continues to
be ignored. I know you guys must be just messing with me
because there are guys that are not just messing with me on
several other groups. They can prove that they know what
they are talking about by explaining how the underlying
details fit together.

Joseph M. Newcomer

unread,

Mar 22, 2010, 7:59:55 PM3/22/10

to

See below...

On Mon, 22 Mar 2010 16:27:48 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Pete Delgado" <Peter....@NoSpam.com> wrote in message
>news:OIehRQgy...@TK2MSFTNGP04.phx.gbl...
>>
>> "Peter Olcott" <NoS...@OCR4Screen.com> wrote in message
>> news:osWdnaGZ3q06RTrW...@giganews.com...
>>>
>>> "Joseph M. Newcomer" <newc...@flounder.com> wrote in
>>> message
>>> news:ecdfq5lb57qrou47d...@4ax.com...
>>>> ****
>>>> He has NO CLUE as to what a "memory-mapped file"
>>>> actually is. This last comment indicates
>>>
>>> http://en.wikipedia.org/wiki/Memory-mapped_file
>>> Apparently I do.
>>
>> I think you would be far better served by looking at
>> Windows specific information on memory mapped files such
>> as that which Joe suggested to you some time ago:
>> Richter's Programming Applications for Microsoft Windows
>> 4th.
>>
>> -Pete
>>
>>
>
>Joe kept insisting and continues to insist that my data is
>not resident in memory.

***
I do not recall asserting that; I pointed out that Windows pre-outpages unused pages and
marks the slots for reuse, but that is not what you claim I stated.
****

>
>After loading my data and waiting twelve hours the process
>monitor reports zero page faults, when I execute my process
>and run it to completion.

***
That is useful data, but has nothing to do with the multithreading question. IT only
demonstrates that a tiny number of pages had been moved out (my recollection is you said 5
page faults, not zero).

****

>
>How does this not prove Joe is wrong (At least in the
>specific instance of one execution of my process)?
>(1) The process monitor is lying.
>(2) Page faults do not measure virtual memory usage.
>

****
It says nothing about one execution; it says that under certain conditions, paging is not
an issue. It does not say anything about using multiple threads on multiple cores within
a single process.

You seem to think it does.
joe
****

Peter Olcott

unread,

Mar 22, 2010, 8:11:14 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:4u0gq5t7htjkq7pae...@4ax.com...

This is the earlier issue where you claimed that my thinking
that I needed to have my data resident in RAM was absurd and
based on ignorance. I do need to have my data resident in
RAM and indeed my data is resident in RAM for extended
periods, and there is no ignorance associated with this
thinking.

Joseph M. Newcomer

unread,

Mar 22, 2010, 8:19:20 PM3/22/10

to

See below...

On Mon, 22 Mar 2010 18:46:24 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Hector Santos" <sant...@nospam.gmail.com> wrote in message
>news:exGTQtgy...@TK2MSFTNGP06.phx.gbl...
>> Peter Olcott wrote:
>>
>>> You tell me all about pages faults, yet the process
>>> monitor
>>> reports zero page faults, and you continue to claim that
>>> its
>>> all about page faults, and virtual memory.
>>
>>
>> Its not a claim - its a fact.
>>
>>> Pages faults indicate victual memory usage right?
>>
>>
>> It shows when your PROCESS is asking too much the can
>> provide to you all in memory - it has to virtualize it.
>>
>>> A lack of page faults indicates a lack of virtual memory
>>> usage right?
>>
>>
>> No. If its zero or not changing and I know your process is
>> not, it means that your process working set is not
>> demanding more than it can handle or OTHER processes have
>> not chewed up memory, limiting your available memory.
>
>OK so zero page faults does not mean that virtual memory is
>not being used?

***
OF COURSE virtiual memory is being used; there is NO OTHER KIND OF MEMORY for a process.
What it means is that all of the virtual memory has remained resident, something that
before you published this results was not something that was demonstrable. You have
demonstrated that it is not being paged out and the pages reused, at least under your test
scenario.
***

>(1) YES zero page faults means that virtual memory is not
>active on this process

****
But given that there is only virtual memory, you cannot assert that zero page faults mean
it is not being used, only that the virtual pages have remained in memory, a useful piece
of knowledge. And if you had a clue about memory-mapped files, this would tell you that
using a named, shared segment would improver performance of multiple processes using MMF
to get the data in, and it might also mean you wouldn't see the several-minute startup
transient. You certainly wouldn't see it on the second or higher processes.

Perhaps you can explain whay you man by "virtual memory is not active". Alas, for your
way of forming the question, virtual memory is ALWAYS active, and that implies the
potential for TLB thrashing. Do you know what a TLB is, the role it serves, and how it is
managed? Of what TLB thrashing might be? To simplify the task, I will tell you that TLB
stands for Translation Lookaside Buffer. The rest is up to you. If you didn't know what
at TLB is when you asked the question, it proves that you had no clue about why the
question as stated is nonsense.
****

>(2) Not (YES zero page faults means that virtual memory is
>not active on this process)

****
Virtual memory HAS To be "active" because there isn't any other kind of memory available
to a process. This is inherent in every operating system.
****

>
>Which is it (1) or (2) ??? Any hem hawing will be taken
>as intentional deceit

****
THe question is ill-formed because the terms are being used incorrectly and in some cases
in ways that suggest that the answer could be "there is no virtual memory being used"
which is a nonsensical statement. There is ONLY virtual memory being used, it just
happens that it is not paged out or the pageouts retain their in-memory images so any page
faults are "soft" (meaning the data does not have to be read off the disk, because it can
be found already in memory, in an un-reused page).

You can't hem or haw an ill-formed question, whether the goal is or is not to be
deceitful. he question makes no sense as stated and the two alternative answers are both
nonsensical. It is almost but not quite as bad as the "have you stopped beating your
wife?" style questions, for which any answer that is "yes" or "no" is damning. So I will
state that the question is ill-formed and the two options presented as answers are
nonsensical, and that is the complete TRUTH. There is no reason to be deceitful here, you
asked a question for which the correct answer is:

"This demonstrates that Windows retains pages in memory when there is no need to page them
out and reuse the page frames" and that is the ONLY truthful and correct answer.

Unfortunately, it is not either of the nonsensical alteratives you allow. if you don't
undetstand why the question is nonsensical as stated and both alternative answers are
nonsensical, you are demonstrating that you really, truly are clueless about how operating
systems work.
****

>
>>
>> You have NO control over this UNLESS you explicitly told
>> windows to use NO-CACHING, NO BUFFER I/O memory.
>>
>> Why do you refuse to believe this? Every Windows
>> programmer has to know this to some degree.
>>
>>> How does this not prove that my data is in memory and
>>> thus you are wrong when you say that my data is not
>>> resident in memory?
>>
>>
>> It doesn't.
>>
>> Allow a Technical Fellow Engineer at Microsoft explain it
>> all to you:
>>
>> Pushing the Limits of Windows: Physical Memory
>> http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx
>>
>> Pushing the Limits of Windows: Virtual Memory
>> http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx
>>
>> Pushing the Limits of Windows: Paged and Nonpaged Pool
>> http://blogs.technet.com/markrussinovich/archive/2009/03/26/3211216.aspx
>>
>> Are you going to argue with Mark too?
>>
>> --
>> HLS
>

Hector Santos

unread,

Mar 22, 2010, 8:18:11 PM3/22/10

to

Peter Olcott wrote:

> "Hector Santos" <sant...@nospam.gmail.com> wrote in message
> news:exGTQtgy...@TK2MSFTNGP06.phx.gbl...
>> Peter Olcott wrote:

>> No. If its zero or not changing and I know your process is
>> not, it means that your process working set is not
>> demanding more than it can handle or OTHER processes have
>> not chewed up memory, limiting your available memory.
>
> OK so zero page faults does not mean that virtual memory is
> not being used?
> (1) YES zero page faults means that virtual memory is not
> active on this process
> (2) Not (YES zero page faults means that virtual memory is
> not active on this process)
>
> Which is it (1) or (2) ??? Any hem hawing will be taken
> as intentional deceit

None of the above:

Your process AT THAT MOMENT does not need PAGE anything because
it was already in your WORKING SET.

Look, YOU SIMPLE PROGRAM IS ALWAYS USING VIRTUALIZE MEMORY! ALWAYS!

Please answer these questions:

Did you try the memory load program?

// File: V:\bin\memload.cpp

#include <stdio.h>
#include <windows.h>

void main(char argc, char *argv[])
{
MEMORYSTATUS ms;
ms.dwLength = sizeof(ms);
GlobalMemoryStatus(&ms);
printf("Memory Load: %d%%",ms.dwMemoryLoad);
}

Did you compile your application to use 64BIT? Did you convert all
your 32 BIT variables to 64BIT?

Did you read the expert Mark R?

Pushing the Limits of Windows: Physical Memory
http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx

Pushing the Limits of Windows: Virtual Memory
http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx

Pushing the Limits of Windows: Paged and Nonpaged Pool
http://blogs.technet.com/markrussinovich/archive/2009/03/26/3211216.aspx

--
HLS

Hector Santos

unread,

Mar 22, 2010, 8:20:41 PM3/22/10

to

Peter Olcott wrote:

>> And I suggested waaaaaaaay back in the beginning of this
>> thread. :) I even gave him a link for a sweet
>> CMemoryMapFile class at MSDN!
>

> I have proven that this is moot, and this proof continues to
> be ignored.

What proof? You provided no proof of anything whatsoever, not even
the existence of your OCR program.

--
HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 8:32:33 PM3/22/10

to

See below...

On Mon, 22 Mar 2010 06:28:36 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Hector Santos" <sant...@nospam.gmail.com> wrote in message

>news:%23IrS0xY...@TK2MSFTNGP04.phx.gbl...
>> Here is the result using a 1.5GB readonly memory mapped
>> file. I started with 1 single process thread, then switch
>> to 2 threads, then 4, 6, 8, 10 and 12 threads. Notice how
>> the processing time for the earlier threads started high
>> but decreased with the later thread. This was the caching
>> effect of the readonly memory file. Also note the Global
>> Memory Status *MEMORY LOAD* percentage. For my machine, it
>> is at 19% at steady state. But as expected it shuts up
>> when dealing with this large memory map file. I probably
>> can fine tune the map views better, but they are set as
>> read only. Well, I'll leave OP to figure out memory maps
>> coding for his patented DFA meta file process.
>>
>> V:\wc5beta>testpeter3t /s:3000000 /r:1
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 0
>> ---------------------------------------
>> Time: 2984 | Elapsed: 0
>> ---------------------------------------
>> Total Client Time: 2984
>>
>> V:\wc5beta>testpeter3t /s:3000000 /t:2 /r:1
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 0
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> * Resuming threads
>> - Resuming thread# 0 in 41 msecs.
>> - Resuming thread# 1 in 467 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 96%
>> * Done
>> ---------------------------------------
>> 0 | Time: 5407 | Elapsed: 0
>> 1 | Time: 4938 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 10345
>>
>> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:4
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 0
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> - Creating thread 2
>> - Creating thread 3
>> * Resuming threads
>> - Resuming thread# 0 in 41 msecs.
>> - Resuming thread# 1 in 467 msecs.
>> - Resuming thread# 2 in 334 msecs.
>> - Resuming thread# 3 in 500 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 97%
>> * Done
>> ---------------------------------------
>> 0 | Time: 6313 | Elapsed: 0
>> 1 | Time: 5844 | Elapsed: 0
>> 2 | Time: 5500 | Elapsed: 0
>> 3 | Time: 5000 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 22657
>>
>> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:6
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 0
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> - Creating thread 2
>> - Creating thread 3
>> - Creating thread 4
>> - Creating thread 5
>> * Resuming threads
>> - Resuming thread# 0 in 41 msecs.
>> - Resuming thread# 1 in 467 msecs.
>> - Resuming thread# 2 in 334 msecs.
>> - Resuming thread# 3 in 500 msecs.
>> - Resuming thread# 4 in 169 msecs.
>> - Resuming thread# 5 in 724 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 97%
>> * Done
>> ---------------------------------------
>> 0 | Time: 6359 | Elapsed: 0
>> 1 | Time: 5891 | Elapsed: 0
>> 2 | Time: 5547 | Elapsed: 0
>> 3 | Time: 5047 | Elapsed: 0
>> 4 | Time: 4875 | Elapsed: 0
>> 5 | Time: 4141 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 31860
>>
>> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:8
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 16
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> - Creating thread 2
>> - Creating thread 3
>> - Creating thread 4
>> - Creating thread 5
>> - Creating thread 6
>> - Creating thread 7
>> * Resuming threads
>> - Resuming thread# 0 in 41 msecs.
>> - Resuming thread# 1 in 467 msecs.
>> - Resuming thread# 2 in 334 msecs.
>> - Resuming thread# 3 in 500 msecs.
>> - Resuming thread# 4 in 169 msecs.
>> - Resuming thread# 5 in 724 msecs.
>> - Resuming thread# 6 in 478 msecs.
>> - Resuming thread# 7 in 358 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 96%
>> * Done
>> ---------------------------------------
>> 0 | Time: 6203 | Elapsed: 0
>> 1 | Time: 5734 | Elapsed: 0
>> 2 | Time: 5391 | Elapsed: 0
>> 3 | Time: 4891 | Elapsed: 0
>> 4 | Time: 4719 | Elapsed: 0
>> 5 | Time: 3984 | Elapsed: 0
>> 6 | Time: 3500 | Elapsed: 0
>> 7 | Time: 3125 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 37547
>>
>> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:10
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 0
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> - Creating thread 2
>> - Creating thread 3
>> - Creating thread 4
>> - Creating thread 5
>> - Creating thread 6
>> - Creating thread 7
>> - Creating thread 8
>> - Creating thread 9
>> * Resuming threads
>> - Resuming thread# 0 in 41 msecs.
>> - Resuming thread# 1 in 467 msecs.
>> - Resuming thread# 2 in 334 msecs.
>> - Resuming thread# 3 in 500 msecs.
>> - Resuming thread# 4 in 169 msecs.
>> - Resuming thread# 5 in 724 msecs.
>> - Resuming thread# 6 in 478 msecs.
>> - Resuming thread# 7 in 358 msecs.
>> - Resuming thread# 8 in 962 msecs.
>> - Resuming thread# 9 in 464 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 97%
>> * Done
>> ---------------------------------------
>> 0 | Time: 7234 | Elapsed: 0
>> 1 | Time: 6766 | Elapsed: 0
>> 2 | Time: 6422 | Elapsed: 0
>> 3 | Time: 5922 | Elapsed: 0
>> 4 | Time: 5750 | Elapsed: 0
>> 5 | Time: 5016 | Elapsed: 0
>> 6 | Time: 4531 | Elapsed: 0
>> 7 | Time: 4125 | Elapsed: 0
>> 8 | Time: 3203 | Elapsed: 0
>> 9 | Time: 2703 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 51672
>>
>> V:\wc5beta>testpeter3t /s:3000000 /r:1 /t:12
>> - size : 3000000
>> - memory : 1536000000 (1500000K)
>> - repeat : 1
>> - Memory Load : 25%
>> - Allocating Data .... 16
>> * Starting threads
>> - Creating thread 0
>> - Creating thread 1
>> - Creating thread 2
>> - Creating thread 3
>> - Creating thread 4
>> - Creating thread 5
>> - Creating thread 6
>> - Creating thread 7
>> - Creating thread 8
>> - Creating thread 9
>> - Creating thread 10
>> - Creating thread 11
>> * Resuming threads
>> - Resuming thread# 0 in 41 msecs.
>> - Resuming thread# 1 in 467 msecs.
>> - Resuming thread# 2 in 334 msecs.
>> - Resuming thread# 3 in 500 msecs.
>> - Resuming thread# 4 in 169 msecs.
>> - Resuming thread# 5 in 724 msecs.
>> - Resuming thread# 6 in 478 msecs.
>> - Resuming thread# 7 in 358 msecs.
>> - Resuming thread# 8 in 962 msecs.
>> - Resuming thread# 9 in 464 msecs.
>> - Resuming thread# 10 in 705 msecs.
>> - Resuming thread# 11 in 145 msecs.
>> * Wait For Thread Completion
>> - Memory Load: 97%
>> * Done
>> ---------------------------------------
>> 0 | Time: 7984 | Elapsed: 0
>> 1 | Time: 7515 | Elapsed: 0
>> 2 | Time: 7188 | Elapsed: 0
>> 3 | Time: 6672 | Elapsed: 0
>> 4 | Time: 6500 | Elapsed: 0
>> 5 | Time: 5781 | Elapsed: 0
>> 6 | Time: 5250 | Elapsed: 0
>> 7 | Time: 4953 | Elapsed: 0
>> 8 | Time: 3953 | Elapsed: 0
>> 9 | Time: 3484 | Elapsed: 0
>> 10 | Time: 2750 | Elapsed: 0
>> 11 | Time: 2547 | Elapsed: 0
>> ---------------------------------------
>> Total Time: 64577
>>
>>
>> --
>> HLS
>
>OK and where is the summary conclusion?
>Also by using a memory mapped file your process would have
>entirely different behavior than mine.
>
>I known that it is possible that you could have been right
>all along about this, and I could be wrong. I know this
>because of a term that I coined. [Ignorance Squared].
>
>[Ignorance Squared] is the process by which a lack of
>understanding is perceived by the one whom lacks this
>understanding as disagreement. Whereas the one whom has
>understanding knows that the ignorant person is lacking
>understanding the ignorant person lacks this insight, and is
>thus ignorant even of their own ignorance, hence the term
>[Ignorance Squared] .
***
It is your description. You have consistently shown a lack of understanding of operating
systems, by making nonsensical statements about how memory allocation works, how threading
works, and how virtual memory works. So to me, it sounds like this term you define is
begin used in a self-descriptive fashion.

And all we are telling you is that you know so little of what is going on at every level
of storage management that your flat statements about peformance have no basis, and that
you should run experiments to see if your guesses are correct or not, and you keep telling
us that by sheer guesswork you can arrive at a conclusion that highly experienced
performance people would never dare pressent without substantiating data. Yet you claim
you MUST be right. Hector and I pretty much claim that we want to see NUMBERS that prove
what is going on. Ever-so-slowly you produce one number or another, whereas I don't think
either of us would have made ANY statements without a WHOLE LOT MORE actual measurements
to prove or disprove our hyphotheses. You keep saying what MUST be so, and in only ONE
case (the page fault example) have you actually gone out and gotten substantiating data to
prove you are right, that with an excess or RAM the page faults drop to zero.

My big objection is that you refuse to make measurements because you are so convinced of
your correctness that it doesn't occur to you that it is actually working to your
advantage to be wrong (if you're wrong, you are losing performance you might have gotten).
And your one flawed experiment, two massive processes on the same core, is not a valid
measure of anything except what happens if you run two masive processes on the same core.
You did not run two massive processes sharing a single memory segment, or multiple
processes on multiple cores, or muliple processes sharing a single memory segment on
multiple cores, or any of the other interestring variants that should be measured. Yet you
have an entire business stragegy which is predicated on high performance, and you ignore
every suggestion that might lead to improved performance.

Only government economists extrapolate from a single data point.
joe
****
>
>Now that I have a way to empirically validate your theories
>against mine (that I dreamed up last night while sleeping) I
>will do this.

Peter Olcott

unread,

Mar 22, 2010, 8:39:29 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:%23pueS5h...@TK2MSFTNGP05.phx.gbl...

> Peter Olcott wrote:
>
>> "Hector Santos" <sant...@nospam.gmail.com> wrote in
>> message news:exGTQtgy...@TK2MSFTNGP06.phx.gbl...
>>> Peter Olcott wrote:
>
>
>>> No. If its zero or not changing and I know your process
>>> is not, it means that your process working set is not
>>> demanding more than it can handle or OTHER processes
>>> have not chewed up memory, limiting your available
>>> memory.
>>
>> OK so zero page faults does not mean that virtual memory
>> is not being used?
>> (1) YES zero page faults means that virtual memory is not
>> active on this process
>> (2) Not (YES zero page faults means that virtual memory
>> is not active on this process)
>>
>> Which is it (1) or (2) ??? Any hem hawing will be
>> taken as intentional deceit
>
>
> None of the above:
>
> Your process AT THAT MOMENT does not need PAGE anything
> because
> it was already in your WORKING SET.
>
> Look, YOU SIMPLE PROGRAM IS ALWAYS USING VIRTUALIZE
> MEMORY! ALWAYS!

You just contradicted yourself, but, there was no
intentional deceit.
(1) It is ALWAYS using virtual memory
(2) There was an instance where is was not using virtual
memory

>
> Please answer these questions:
>
>
> Did you try the memory load program?

I will go back and carefully study these threads at the
point in time where my strategy to always make sure that I
have twice as much RAM as the total load on the system
requires fails. I expect this to be never.

Hector Santos

unread,

Mar 22, 2010, 8:38:23 PM3/22/10

to

Joseph M. Newcomer wrote:

Joe, he changes his statements to suit whatever it is he is trying to
prove but can not. I honestly don't think he knows anything Microsoft
C/C++, MFC, WIN32 and what he has a OCR compiled code was probably
taken from open source and it COMPILED the first time. He has not
product, no demo, nothing to show he has anything and even if he does
have software copied from a DFA C example code book, he doesn't know
anything about optimizing it.

All he knows he need 4GB and thought that getting 8GB would be good
enough to run at least two instances each with redundant 4GB memory
allocations with no lost in speed.

But he is finding out otherwise, that is how all this started.

So he is wondering and found out the Lx Chip caching didn't help. He
concluded that he has an special patented OCR application that
exhausted the physical capabilities of the computer, so special, its
only possible to run once.

Yet, he refuses, and I can only presume its because he doesn't know
how, to explore the simulator and doesn't grasp any of the technical
writeups and links provided regarding virtual memory and multi-threads
operations.

It defies logic.

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 8:45:31 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:661gq5l01bg7rf539...@4ax.com...

And of course you know that a second thread would work just
fine because you know that my process is not memory
bandwidth intensive.

>
> Perhaps you can explain whay you man by "virtual memory is
> not active". Alas, for your

If there are no pages going in and out of physical memory it
is active in the same way that a parked car is active. Maybe
you call it active if the engine is still running, even if
its not going anywhere.

> way of forming the question, virtual memory is ALWAYS
> active, and that implies the

Right and it is even active if it is disabled, because we
all know that it is ALWAYS active.

Hector Santos

unread,

Mar 22, 2010, 8:44:25 PM3/22/10

to

Peter Olcott wrote:

>> It says nothing about one execution; it says that under
>> certain conditions, paging is not
>> an issue. It does not say anything about using multiple
>> threads on multiple cores within
>> a single process.
>
> This is the earlier issue where you claimed that my thinking
> that I needed to have my data resident in RAM was absurd and
> based on ignorance. I do need to have my data resident in
> RAM and indeed my data is resident in RAM for extended
> periods, and there is no ignorance associated with this
> thinking.

But is ignorance because your PROCESS MEMORY is VIRTUAL MEMORY!

Look, for people who have PCs with 1GB or 2GB of RAM, the PROCESS
STILL GETS 4GB.

Where is it "GHOST RAM" coming from?

--
HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 9:00:18 PM3/22/10

to

see below...

On Mon, 22 Mar 2010 12:21:52 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Hector Santos" <sant...@nospam.gmail.com> wrote in message

>news:%23o9WRud...@TK2MSFTNGP06.phx.gbl...

>> Peter Olcott wrote:
>>
>>
>>>> If you can see that in the code, then quite honestly,
>>>> you don't know how to program or understand the concept
>>>> of programming.
>>
>>

>>> I am telling you the truth, I am almost compulsive about
>>> telling the truth. When the conclusions are final I will
>>> post a link here.
>>
>>
>> What GROUP is this? No one will trust your SUMMARY unless
>> you cite the group. Until you do so, you're lying and
>> making things up.
>>
>> I repeat: If you can't see the code I posted proves your
>> thinking is incorrect, you don't know what you are talking
>> about and its becoming obvious now you don't have any kind
>> of programming or even engineering skills.
>>
>> --
>> HLS
>
>I did not examine the code because I did not want to spend
>time looking at something that is not representative of my
>process. Looks at the criteria on my other post, and if you
>agree that it meets this criteria, then I will look at your
>code.
>
>You keep bringing up memory mapped files. Although this may
>very well be a very good way to use disk as RAM, or to load
>RAM from disk, I do not see any possible reasoning that
>could every possibly show that a hybrid combination of disk
>and RAM could ever exceed the speed of pure RAM alone.****
***
WHich proves you have no idea what we are talking about. THere is no difference between
using disk as RAM to load large chunks of data and using it to load, say, executable code
(which is how .exe files and .dll files are managed), or loading the data in via ReadFile,
except for the minor detail that the segment can be shared). Because the pages are
brought in via the paging mechanism, they are just pages, and if (as you have
demonstrated) these pages remain not-paged-out, they will be there when you go to use them
again, in every process that shares that segment. [I'm giving away some of the answers I
wanted you to research on your own].
***
>
>If you can then please show me the reasoning that supports
>this. Reasoning is the ONLY source of truth that I trust,
****
Gee, this is how we got the notion that heavier objects fall faster than lighter objects,
that all knowledge was known to the Greeks (such as Aristotle), that the Sun circles the
Earth, that the Earth is flat, the Heavenly Sphere is perfect (so mountains on the moon,
or moons around Jupiter or Saturn, are not possible, nor can there be "new stars"),
planetary orbits are circular, and other fundamental truths. It was OBSERVATION that
disproved these ideas. (And Galieleo's proof was not done by dropping cannonballs from
the Leaning Tower of Pisa, but by rolling objects down inclined planes)

I tend to trust measurable facts over pretty theories. The ONLY truth is the truth that
you can measure and put a number on.

After the observations, a few pioneers like Newton, Kepler, etc. came along and came up
with theories. And even the best theory doesn't solve the 3-body problem with a
closed-form solution.
****
>all other sources of truth are subject to errors. Reasoning
>is also subject to errors, but, these errors can be readily
>discerned as breaking one or more of the rules of correct
>reasoning.
****
Note that most theories evolve what are called "closed-form mathematical solutions", that
is, you get a formula and you plub some numbers in, and lo! out comes an answer. Then
someone comes along and shows that what you have is NOT a closed form solution, but might
even be a closed-form solution under only specified initial conditions, which cannot be
guaranteed to exist uner normal operating conditions. Why do you think we build
supercolliders like the Large Hadron Collider, or do astronomy? Or have people worried
about black hole event horizons? Because the only source of truth is to observe the
Universe and what it does. Nice reasoning does NOT predict the photoelectric effect, but
if you take as a premise something different, like quantum physics, the photoelectric
efferct follows naturally. Or look at quantum electrodynamics and how it explains optical
effects such as reflection, lenses, prisms, and interference fringes with a SINGLE theory.
The problem is that we have no closed form solutions for multilevel cache behavior, nor is
there a simple theory of operation of "highly pipelined highly-concurrent
asynchronous-execution superscalar with multilevel cache" architecture performance, so you
don' have a way to reason about them. You can guess, but your guess may not be correct.

So I said "run the measurements" and you say "I don't have to, I know the planetary orbits
or circular, because simple reason tells me This Must Be So." Or the modern equivalent.
And I'm saying "run the experiment, see if the data agrees with the theory".
joe
****

Joseph M. Newcomer

unread,

Mar 22, 2010, 9:05:42 PM3/22/10

to

See below...

***
Wrong. For reasons I have tried to explain any number of times! Do you even know what
the term "virtual memory" means?

You CANNOT avoid virtual memory usage in a Windows, Unix, or linux app. It is the ONLY
REALITY that exists at application level. In Windows, it is the only reality that exists,
period. Even the kernel runs entirely in virtual memory!
****

>
>** Do virtual memory maps apply to anything else besides
>virtual memory?

***
Why would they need to? That's what virtual memory is all about!
****

>
>>
>> I repeat my assertion: you are clueless. Anyone who
>> tries to use a generic wikipedia
>> article as proof of being not clueless is metaclueless.
>> joe
>> ***

>>>

Hector Santos

unread,

Mar 22, 2010, 9:02:06 PM3/22/10

to

Peter Olcott wrote:

>>> Which is it (1) or (2) ??? Any hem hawing will be
>>> taken as intentional deceit
>>
>> None of the above:
>>
>> Your process AT THAT MOMENT does not need PAGE

>> anything because it was already in your WORKING SET.

>>
>> Look, YOU SIMPLE PROGRAM IS ALWAYS USING VIRTUALIZE
>> MEMORY! ALWAYS!
>
> You just contradicted yourself, but, there was no
> intentional deceit.
> (1) It is ALWAYS using virtual memory
> (2) There was an instance where is was not using virtual
> memory

I don't see that as a contradiction at all. Your process gets 4GB
Virtual Memory. There was a moment in space and time that your
program did not need the OS to page data into your WORKING SET which
is the virtual memory in active use by your program.

For example, program has allocated 2GB. It may look like your program
has access to 2GB and thats the overall idea, but its virtualized.
When you reference something in the 2GB that is not in the WORKING
SET, then you get a PAGE FAULT which tells the system to go get from
the page.sys file the data you need.

Look it says it right here in MSDN:

http://support.microsoft.com/kb/555223

In modern operating systems, including Windows, application
programs and many system processes *ALWAYS* reference memory using
virtual memory addresses which are automatically translated to real
(RAM) addresses by the hardware. Only core parts of the operating
system kernel bypass this address translation and use real memory
addresses directly.

Virtual Memory is always in use, *EVEN* when the memory required
by all running processes does not exceed the amount of RAM
installed on the system.

REPEAT THE FIRST SENTENCE IN EACH PARAGRAPH 1000 TIMES!

How ignorance can you be? When did you start using Windows? or
program for it?

--
HLS

Hector Santos

unread,

Mar 22, 2010, 9:04:58 PM3/22/10

to

Peter Olcott wrote:

> And of course you know that a second thread would work just
> fine because you know that my process is not memory
> bandwidth intensive.

yes, we know that. The simulator, real code with shared memory and
multiple threads, proved this and if you took the time to explore it,
you will see for yourself.

--
HLS

Joseph M. Newcomer

unread,

Mar 22, 2010, 9:17:47 PM3/22/10

to

See below...

On Mon, 22 Mar 2010 16:59:34 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Hector Santos" <sant...@nospam.gmail.com> wrote in message
>news:%23F2oLmg...@TK2MSFTNGP06.phx.gbl...
>> Peter Olcott wrote:
>>
>>> Joe kept insisting and continues to insist that my data
>>> is not resident in memory.
>>
>>
>> If you have a 32 bit Windows OS, you are limited to just
>> 2GB RAW ACCESS and 4GB of VIRTUAL MEMORY.
>
>Yes, and that is another thing. I kept saying that I have a
>64bit OS, and Joe kept forming his replies in terms of a
>32-bit OS.

****
And how long did I keep saying "Unless you are running a WIn32 process in Win64" but you
did not clarify that you were running on Win64. So in the absence of any explicit
statement I had to assume you were running in Win32.
****

>
>>
>> If your process is loading 4GB, you are using virtual
>> memory.
>>
>>> After loading my data and waiting twelve hours the
>>> process monitor reports zero page faults, when I execute
>>> my process and run it to completion.
>>
>>
>> You're lying, you told me you have PAGE FAULTS but it
>> settle down to zero, which is NORMAL. But start a 2nd
>> process and you will get page faults.
>
>I only get the page faults until the data is loaded. After
>the data is loaded I get essentially no more page faults,
>even after waiting twelve hours before running my process to
>completion. After proving that my data is resident in RAM
>Joe continues to chide me for claiming that my data is
>resident in RAM.

****
If you used a memory-mapped file correctly, yu would have very low-cost page faults
because you would be mapping to existing pages. But you seem to not want to hear that
memory-mapped files will improve performance, particularly in a multiple-process
environment.
joe
****

>
>You guys just playing head games with me?

****
We are trying to help you, in spite of your best efforts to tell us we are wrong. You
insist that simplistic experiments which gave you a single data point give you a basis for
extrapolating an entire family of performance information, and we are saying "You don't
KNOW until you've MEASURED" and you insist that measurement is not relevant because you
MUST be right. All I'm saying is that you MIGHT be right, and once you do the
measurements, you might find out that you are completely WRONG, which works to your
advantage. So run the damn expeimet, already!
joe

****

Peter Olcott

unread,

Mar 22, 2010, 9:19:32 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:db2gq5tuuu12sbg67...@4ax.com...

I always boil everything down to its bare essence, and
remove any extraneous details that do not specifically and
directly completely pertain the precise point at hand. I use
categorical thinking, not item by item detail by detail
unless these details can be shown to be 100% completely
relevant to the exact precise point at hand.

You seem to come at things from the opposite point of view
carefully examining every little nuance of a detail just in
case it might possibly be at least slightly relevant. There
are cases where caching can not improve performance, so try
to see if we can categorically eliminate the need to look at
this before I proceed a micro step towards considering any
of its details.

How could I very quickly measure exactly how much of the
total memory bandwidth that my process takes?

I am pretty sure that it takes most all of it, thus proving
at least one of my points without the need for further
investigation on this points. It would prove that adding
another thread can't possibly help. How do I quickly and
accurately measure my processes memory bandwidth usage?

All of this will soon be moot anyway because my updated
process will have substantially different memory access
requirements.

Joseph M. Newcomer

unread,

Mar 22, 2010, 9:22:56 PM3/22/10

to

On Mon, 22 Mar 2010 16:44:49 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newc...@flounder.com> wrote in

>message news:ifofq5hdvoesvnv8a...@4ax.com...
>> see below...
>> On Mon, 22 Mar 2010 12:15:27 -0500, "Peter Olcott"

>> <NoS...@OCR4Screen.com> wrote:
>>
>>>
>>>"Hector Santos" <sant...@nospam.gmail.com> wrote in
>>>message

>>>news:uGVpjmdy...@TK2MSFTNGP06.phx.gbl...
>>>> Peter Olcott wrote:
>>>>
>>>>> Since my process (currently) requires unpredictable
>>>>> access to far more memory than can fit into the largest
>>>>> cache, I see no possible way that adding 1000-fold
>>>>> slower
>>>>> disk access could possibly speed things up. This seems
>>>>> absurd to me.
>>>>
>>>>
>>>> And I would agree it would be seem to be absurd to
>>>> inexperience people.
>>>>
>>>> But you need to TRUST the power of your multi-processor
>>>> computer because YOU are most definitely under utilizing
>>>> it by a long shot.
>>>>
>>>> The code I posted is the proof!
>>>
>>>If it requires essentially nothing besides random access
>>>to
>>>entirely different places of 100 MB of memory, thenn (then
>>>and only then) would it be reasonably representative of my
>>>process. Nearly all the my process does is look up in
>>>memory
>>>the next place to look up in memory.
>> ***
>> Then the CORRECT approach is not to say "I don't want to
>> waste time reading your example
>> because it doesn't match my problem", the CORRECT approach
>> is to say "I have no modified
>> your example to more closely resemble my access patterns,
>> ran it, and got THIS result" and
>> show the results you got.
>> joe

>
>I think that you must have paraphrased me incorrectly and I
>don't see the context in this post. I would not likely have
>concluded that it does not match my problem, and in fact
>remember asking Hector if it did match my problem. I stated
>my problem much more concisely than he stated his proof.

****
So I misremembered

>I did not examine the code because I did not want to spend
>time looking at something that is not representative of my
>process.

?
*****

>
>The single sentence is above. Hector never told me whether
>or not he thought that it matched my algorithm's memory
>usage pattern.
>
>>
>>>
>>>>

>>>> Your issue is akin to having a pickup truck, overloading
>>>> the back, piling things on each other, overweight beyond
>>>> the recommended safety levels per specifications of the
>>>> car manufacturer (and city/state ordinances), and now
>>>> your
>>>> driving, speed, vision of your truck are all altered.
>>>> Your truck won't go as fast now and if even if you
>>>> could,
>>>> things can fall, people can die, crashes can happen.
>>>>
>>>> You have two choices:
>>>>
>>>> - You can stop and unload stuff and come back and
>>>> pick
>>>> it up on
>>>> 2nd strip, your total travel time doubled.
>>>>
>>>> - you can get a 2nd pick up truck, split the load and
>>>> get
>>>> on a four lanes highway and drive side by side,
>>>> sometimes
>>>> one creeps ahead, and the other moves ahead, and
>>>> both reach
>>>> the destination at the near same expected time.
>>>>
>>>> Same thing!
>>>>
>>>> You are overloading your machine to the point it working
>>>> very very hard to satisfy your single thread process
>>>> needs. You may "believe" it is working at optimal
>>>> speeds
>>>> because it has uninterrupted exclusive access but it is
>>>> not reality. You are under utilizing the power of your
>>>> machine.
>>>>
>>>> Whether you realize it or not, the overloaded pickup
>>>> truck
>>>> is smart and is stopping you every X milliseconds
>>>> checking
>>>> if you have a 2nd pickup truck to offload some work and
>>>> do
>>>> some moving for you!!
>>>>
>>>> You need to change your thinking.
>>>>
>>>> However, at this point, I don't think you have any
>>>> coding
>>>> skills, because if you did, you would be EAGERLY JUMPING
>>>> at the code I provided to see for yourself.
>>>>
>>>> --
>>>> HLS
>>>

Peter Olcott

unread,

Mar 22, 2010, 9:28:40 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:OTB08Hiy...@TK2MSFTNGP06.phx.gbl...

> Peter Olcott wrote:
>
>>> It says nothing about one execution; it says that under
>>> certain conditions, paging is not
>>> an issue. It does not say anything about using multiple
>>> threads on multiple cores within
>>> a single process.
>>
>> This is the earlier issue where you claimed that my
>> thinking that I needed to have my data resident in RAM
>> was absurd and based on ignorance. I do need to have my
>> data resident in RAM and indeed my data is resident in
>> RAM for extended periods, and there is no ignorance
>> associated with this thinking.
>
>
> But is ignorance because your PROCESS MEMORY is VIRTUAL
> MEMORY!

Virtual memory is essentially disk pretending to be RAM,
when disk is not used (no page faults) then it is no longer
disk pretending to be RAM. Even though some of the VM
infrastructure remains in place and still operates,
(requiring a tiny bit of overhead) the part that most
significantly impacts performance is not functioning. Thus
from a performance point of view VM is essentially not
functioning.

If you want to get nit picky and refrain from boiling things
down to their essence you can say that VM is still
operating. For all practical purposes from a pure
performance point of view, VM is impacting performance
negligibly, and thus can be construed as if it was not
functioning. That is one example of the extraneous nit picky
details that always boiling everything to its bare essence
strips from further consideration.

Joseph M. Newcomer

unread,

Mar 22, 2010, 9:29:30 PM3/22/10

to

On Mon, 22 Mar 2010 06:28:36 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Hector Santos" <sant...@nospam.gmail.com> wrote in message

****
Read the data! Don't you see the obvious? If necessary, bring up Excel, plot the data in
a graph. It should jump out at you. I fail to see how it challenges your anlytic
abilities to infer what is going on here.
****

>Also by using a memory mapped file your process would have
>entirely different behavior than mine.

****
Really? You KNOW THIS? I see nothing that would suggest this conclusion follows.
****

>
>I known that it is possible that you could have been right
>all along about this, and I could be wrong. I know this
>because of a term that I coined. [Ignorance Squared].

****
So why did you insist on telling us that our urging to run the experiment was wrong? Why
do you insist on an unsubstantiated conclusion in the absence of real data?
****

>
>[Ignorance Squared] is the process by which a lack of
>understanding is perceived by the one whom lacks this
>understanding as disagreement. Whereas the one whom has
>understanding knows that the ignorant person is lacking
>understanding the ignorant person lacks this insight, and is
>thus ignorant even of their own ignorance, hence the term
>[Ignorance Squared] .
>

>Now that I have a way to empirically validate your theories
>against mine (that I dreamed up last night while sleeping) I
>will do this.
>

Peter Olcott

unread,

Mar 22, 2010, 9:35:58 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:443gq51rqdi6oio92...@4ax.com...

We are back to my categorical reasoning again, boiling this
down to their most bare essence.

If my process is always resident in actual RAM, then
everything about virtual memory and optimizing the way that
virtual memory works becomes entirely moot. I don't want to
spend one micro second on this until it is proven beyond all
possible doubt that my process can not always be resident in
actual RAM. I could easily get caught up in enough details
that the rest of my life is not nearly enough time to get
started in business.

Peter Olcott

unread,

Mar 22, 2010, 9:39:47 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:OXYJ1Ri...@TK2MSFTNGP05.phx.gbl...

>
> Peter Olcott wrote:
>
>>>> Which is it (1) or (2) ??? Any hem hawing will be
>>>> taken as intentional deceit
>>>
>>> None of the above:
>>>
>>> Your process AT THAT MOMENT does not need PAGE
>
> >> anything because it was already in your WORKING SET.
>
>>>
>>> Look, YOU SIMPLE PROGRAM IS ALWAYS USING VIRTUALIZE
>>> MEMORY! ALWAYS!
>>
>> You just contradicted yourself, but, there was no
>> intentional deceit.
>> (1) It is ALWAYS using virtual memory
>> (2) There was an instance where is was not using virtual
>> memory
>
>
> I don't see that as a contradiction at all. Your process
> gets 4GB Virtual Memory. There was a moment in space and
> time that your program did not need the OS to page data
> into your WORKING SET which is the virtual memory in
> active use by your program.

It was not a moment in time it was a 12 hour time period.
The only reason that It ended was that I was convinced that
twelve hours was enough.

For all practical purposes virtual memory is not being used
(meaning that its use is not impacting performance) whenever
zero or very few page faults are occurring.

Peter Olcott

unread,

Mar 22, 2010, 9:58:11 PM3/22/10

to

"Hector Santos" <sant...@nospam.gmail.com> wrote in message

news:u0ggbTi...@TK2MSFTNGP05.phx.gbl...

void Process()
{
KIND num;
for(int r = 0; r < repeat; r++)
for (WORD i=0; i < size; i++)
num = data[i];
}

Not at all representative of my process, thus proves nothing
about my process. Your process could derive pure spatial
locality of reference whereas mine would not. I do not move
to the next sequential memory location, my memory access
(from the cache point of view) is nearly purely random. If
you had a list of 10,000 memory locations that are all very
far from each other, then your process would approximate
mine.

You might also look at the generated code, the optimizer
tends to eliminate code such as your test case. Also you
can't not use the optimizer, because this Could skew the
test from memory intensive to CPU intensive. Maybe you could
proceed through the list and swap the values of the current
item for the value of the preceding item. and loop through
again and again.

Hector Santos

unread,

Mar 22, 2010, 9:58:07 PM3/22/10

to

Peter Olcott wrote:

>> I don't see that as a contradiction at all. Your process
>> gets 4GB Virtual Memory. There was a moment in space and
>> time that your program did not need the OS to page data
>> into your WORKING SET which is the virtual memory in
>> active use by your program.
>
> It was not a moment in time it was a 12 hour time period.
> The only reason that It ended was that I was convinced that
> twelve hours was enough.

You know, you need to really stop this horse stuff of yours. Its
right there , straight from the horse mouth. Why did you choose to
ignore this? I'll show it again:

http://support.microsoft.com/kb/555223

In modern operating systems, including Windows, application
programs and many system processes *ALWAYS* reference memory using
virtual memory addresses which are automatically translated to real
(RAM) addresses by the hardware. Only core parts of the operating
system kernel bypass this address translation and use real memory
addresses directly.

Virtual Memory is always in use, *EVEN* when the memory required
by all running processes does not exceed the amount of RAM
installed on the system.

WHY ARE YOU IGNORING THIS?

You got 8GB on your box and your process is wanting 4GB - it is STILL
virtualize no matter what you do or say.

The whole point of this thread that you SAID you can not run a 2nd
process because kills your system.

Well of course, because now you need 8GB for 2 processes!

If you don't single source the data as a sharable memory, then YOU
WILL NEVER be able to run but only 1 process on your machine.

Thats not because of the physical limitations of the machine. Its
because your PROGRAM is FLAWED and NOT designed for any kind of
scalability or usage but as a single process application - nes pas!

I illustrated and proved to you with posted code how to optimized the
process so that YOU can scale and leverage the power of your machine.

Right now you are utilizing it and using it incorrectly! You really
wasted your money, and if think the solution is to scale out, well,
thats your problem, because you are dumping money down the drain for
nothing.

--
HLS

Peter Olcott

unread,

Mar 22, 2010, 10:07:35 PM3/22/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:l45gq55hlc3sn35e2...@4ax.com...

I don't want to hear about memory mapped files because I
don't want to hear about optimizing virtual memory usage
because I don't want to hear about virtual memory until it
is proven beyond all possible doubt that my process does not
(and can not be made to be) resident in actual RAM all the
time.

Since a test showed that my process did remain in actual RAM
for at least twelve hours, this is sufficient evidence to
show that all of these lines of reason have at least for the
moment become completely moot. The only thing that could
make them less than completely moot would be proof that my
process can not remain resident in RAM all the time.

Hector Santos

unread,

Mar 22, 2010, 10:17:55 PM3/22/10

to

Peter Olcott wrote:

>>
>>> And of course you know that a second thread would work
>>> just fine because you know that my process is not memory
>>> bandwidth intensive.
>>
>> yes, we know that. The simulator, real code with shared
>> memory and multiple threads, proved this and if you took
>> the time to explore it, you will see for yourself.
>>
>

> void Process()
> {
> KIND num;
> for(int r = 0; r < repeat; r++)
> for (WORD i=0; i < size; i++)
> num = data[i];
> }
>
> Not at all representative of my process, thus proves nothing
> about my process.

This is a MAXIMUM MEMORY ACCESS you can every reach. Your application
memory access will be lese stressful.

> Your process could derive pure spatial
> locality of reference whereas mine would not.

and I followed up with a RANDOM access memory access:

void Process()
{
KIND num;
for(int r = 0; r < repeat; r++)
for (WORD i=0; i < size; i++)

DWORD j = (rand() % size);
num = data[j];
}

and provided all the results on that to SHOW that randomness, which is
closer to your unpredictable memory access theory, produced better
results. I even gave you some tips on using pareto's principle
because I don't believe YOUR application is unpredictable YOU seem to
think it is.

> I do not move

> to the next sequential memory location, my memory access
> (from the cache point of view) is nearly purely random.

See above. Again, the serialize access simulation represents the
worst case scenario that will contradict your theory that there is a
major bottle neck with memory access contention with multiple threads.

> If you had a list of 10,000 memory locations that are all very
> far from each other, then your process would approximate
> mine.

The simulator had MAXULONG/6 items of DWORD (4 bytes) array, ~1`.4GB
for a 2GB machine which is 75% of memory capacity, you only have a 50%
memory need - FOR 1 PROCESS. So this simulator is FAR worst case than
yours for MEMORY ACCESS.

> You might also look at the generated code, the optimizer
> tends to eliminate code such as your test case.

Not the case here, and EVEN THEN, there is still 10 loops, MAXULONG/6
items accessed.

The FACT is, it is being read because the MEMORY LOAD and the working
set increases.

The bottom line the code shows your process is scalable when coded
properly to leveraged the technology in the Windows OS with
multi-core hardware.

Your design presumptions that it is memory bound for multi-thread
processing was incorrect.

--
HLS