Photoshop is NOT SMP!

davexl

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

I am sick to death of this misinformation.

Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
hardly any of it IS SMP. See table below.

I ran the infamous Photoshop 5 action file on a dual Celeron 450 in
one and two CPU configuration. To summarise: SMP is implemented so
badly that the difference on this test was 7%. Yes, seven whole
percent. Only five functions in this test were multithreaded. This is
not a biased example of choosing functions either - notice that the
few that are SMP are counted multiple times.

NT is not the problem, other applications like Digital Fusion doing
exactly the same types of calculations can get scaling of 199% on the
same hardware - on EVERY function. Photoshop's abysmal performance is
due to piss poor code by a very lazy, sorry: 'efficient' software
company that won't write modern code until the lowest common
denominator platform can handle it.

Compare NT native apps like Digital Fusion with Adobe equivalents like
After Effects and you will see what I mean. I am dying for OS-X to be
released so Macs and Mac first software companies like Adobe will
'get' the concept of SMP.

I have no love for MS, but I won't blame them for Adobe's mistakes.
Linux and Mac advocates would have you believe that NT or P-IIs are
the problem. NT scales perfectly until over 4 CPUs. So do P-IIs and
Xeons. It is the applications which determine how well SMP works.

If you compare these results to a G3/400 at
http://www.xlr8yourmac.com/tips/photoshop5perf.html you see that a G3
400 scoring 148 'beats' a dual P-II 450 scoring 177.

So, a G3 400 is a bit faster than a P-II 450.

If what you are running is SMP, a dual P-II will be very much faster
than a single G3. If what you are running is not SMP, even a quad
Xeon 450 will be beaten by a G3.

As always, IT DEPENDS.

*rant over*

Dual 450MHz Celeron. Times in seconds, lower is better.

Photoshop 5 Function Dual Single Threading

Rotate 90 0.5 0.5 100%
Rotate 9 2.0 3.0 150%
Rotate .9 1.8 2.8 156%
Gaussian Blur 1 1.2 1.2 100%
Gaussian Blur 3.7 1.7 2.9 171%
Gaussian Blur 85 3.4 5.5 162%
Unsharp 50/1/0 1.3 1.4 108%
Unsharp 50/3/7/0 1.8 3.1 172%
Unsharp 50/10/5 2 3.3 165%
Despeckle 2.1 3.7 176%
RGB-CMYK 5.9 5.9 100%
Reduce Size 60% 0.7 1.2 171%
Lens Flare 5.1 4.9 96%
Color Halftone 11.4 11.2 98%
NTSC Colors 4.3 4.3 100%
Accented Edges 13.2 13.2 100%
Pointillize 18.8 18.7 99%
Water Color 27.3 27.2 100%
Polar Coordinates 12.7 14 110%
Radial Blur 43 42.9 100%
Lighting Effects 10.8 11.9 110%

PS4Bench Index 171 182.8 107%

Average threading = 126%
Average Threading Efficiency where implemented = 165%
Percent of functions threaded = 9 of 21 = 38%

http://www.geocities.com/Paris/Cafe/4363/

And before you get your knickers in a twist, our dual P-II 300s
perform in similar ratios - the cache differences in the Celeron do
not affect the SMP efficiency.

Christian Bau

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

davexl wrote:
>
> I am sick to death of this misinformation.
>
> Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
> hardly any of it IS SMP. See table below.

In the context of this newsgroup, there is the continuing debate between
Macintosh users saying "CMT works in practice" and trolls saying "only
PMT works". The debate is between "works in practice" and "magic
bullet".

So someone posted that in the case of Photoshop and SMP, the "magic
bullet" approach doesnt work. The argument up to this point is like
this:

Troll: SMP works by magic.
Mac user: No it doesnt.
Troll: Yes it does.
Someone posted: I tried Photoshop on Windows NT on a dual processor
machine, and the SMP magic bullet didnt work.
davexl: I am sick to death of this misinformation. The fault is with
Photoshop, not with the magic bullet.

So the magic bullet SMP only works if the application writer
cooperates??? Just like multiprocessing on the Macintosh works if the
application writer cooperates???

Steve

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

And the Mac advocates continue to promote the idea that Photoshop is
the be all and end all of all applications. If somethings works in
Photoshop, they believe it works for all applications, and if it
doesn't work for Photoshop, they believe it doesn't work on all other
applications. One instance of an application where SMP works does not
mean that it works well in all applications any more than one
application where it does not work means that it does not work in all
applications. The advantage for a dual processor system running
Windows NT4 ranges from a slight disadvantage to nearly 100% over a
single processor system. The advantage is also dependent on the
amount of memory in the system, with more memory being a bigger
advantage with dual-processor systems.

See the following for data on single application performance with SMP:

http://www.anandtech.com/html/review_display.cfm?document=483
http://www.newmedia.com/NewMedia/98/09/labreport/charts.html
breakthrough.pdf (downloaded from Intel's web site)

and of course to inject the ever popular ;-) SPEC into the argument,
SPECint_rate95/fp_rate95 (designed to benchmark multi-processor
systems) shows a 96%/69% advantage for a 2-processor Dell PW 410
running NT4 over a single processor system. Thus, NT4 WS is certainly
capable of scaling well to 2 processors, it just depends on whether
multi-tasking is taking place or if the applications are well-designed
for SMP. For a comparison (among many possible) an IBM RISC
System/6000 43P-260 running AIX showed a performance advantage of
100%/91% for a 2-processor system (showing that although NT scales
reasonably well to 2 processors, its not quite as good as Unix).

http://open.specbench.org/osg/cpu95/results/rint95.html
http://open.specbench.org/osg/cpu95/results/rfp95.html

Furthermore, Mac advocates are so trained to discount the advantages
of PMT that they refuse to address the issue of the advantage of SMP
systems when you are multi-tasking. I have addressed this numerous
times in my posts, but Mac advocates have avoided the issue like the
plague. When you are multi-tasking CPU-intensive applications, you
can get a near doubling in performance. For instance, a dual 266 Mhz
PII running Business Winstone 99 and MP3 decompression at the same
time was nearly twice as fast as a single processor system
(http://206.132.42.114/html/review_display.cfm?document=863). The
advantage decreased to less than 25% on a dual 400 Mhz PII system,
probably because the 2 applications were not capable of saturating
2-400 Mhz processors.

The bottom line is that most people who do not multi-task and do not
use high-end technical apps will not see much advantage from SMP
systems. However, those people who use well designed SMP-aware
applications will see a large boost, and those who multi-task
CPU-intensive applications will see large boosts no matter whether or
not the individual applications are SMP-aware.

-Steve

**********
"The problem is I have a life.
I just can't be the CEO at Apple.
I just don't have that to give."
-Steve Jobs, July 1997
**********

Eric Bennett

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
(davexl) wrote:

> NT is not the problem, other applications like Digital Fusion doing
> exactly the same types of calculations can get scaling of 199% on the
> same hardware - on EVERY function. Photoshop's abysmal performance is
> due to piss poor code by a very lazy, sorry: 'efficient' software
> company that won't write modern code until the lowest common
> denominator platform can handle it.

Ha! Now you know what it feels like to get software that's poorly ported
from another OS! ;-)

--
Eric Bennett (www.pobox.com/~ericb), Cornell University

Bill Frisbee

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Eric Bennett <er...@pobox.com> wrote in message
news:ericb-17029...@x3066.resnet.cornell.edu...

Ohhh... good zinger!

Bill F.

Gary

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

davexl wrote:

>
>
> I ran the infamous Photoshop 5 action file on a dual Celeron 450 in
> one and two CPU configuration.

ohhh, the Celeron 450 chip... you mean the that horrible Intel chip that
has no Level 2 cache? (..sort of reminds me of the "turbo" 386 chips of
the early '90s)

WIlliam Holderness

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <36cac48a...@nntp.service.ohio-state.edu>,
nospam@nospam!.kom (Steve) wrote:

>http://www.newmedia.com/NewMedia/98/09/labreport/charts.html

Uhhh...Steve, I followed the links the first had no specs at all and was
basically a "fluff" piece (as was the last link), the second link did show
some performance graphs of real applications.

However... of the systems benchmarked the only clear single/dual processor
comparison (all other things being equal) was the HP Kayak's. And that
showed little no performance enhancement from dual processors except on
one specialized benchmark (MPEG2 compression)

Isn't that what people like Joe R. have been saying all along, that dual
processors are of little performance advantage EXCEPT where the process is
easily divisible and therefore easily distributed? (which MPEG2
compression is)

I would like to add that processes like MPEG2 compression are much better
off on specialized hardware cards/systems that can encode/decode/preview
MPEG2 in real time, about a 10X difference over software CODEC's (as a
side note; the Heuris encoder's performance on a single G3 is smoking
fast)

Interestingly enough the benchmarks (dell 410 Vs 610) also show little
performance advantages of the Xeon over the PentiumII, which is something
else I have heard Joe R. (and other in CSMA) say time and time again.

The remaining two links were just the same tired old Spec marks, which
many (far better educated and qualified than I) argue are synthetic,
contrived and not representative of real world performance. Isn't it
interesting that the difference between the real world program benchmarks
of the second link and the specmarks of the third and fourth links seem to
confirm the error inherent in SpecMarks?

Bill Holderness
IMPACT

Earl Malmrose

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Gary wrote in message <36CB034B...@aol.com>...

Most Celerons have a small, full processor speed L2 cache.

Frank DeCarlo

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

The results at http://www.toolbox.net/300vs300/index.html are in line
with the benchmarking I've done so far. A dual P-II NT system simply
blows away a single G3.

Frank

WIlliam Holderness <bhold...@erols.remove.this.com> wrote in message
news:bholderness-17...@207-172-95-241.s50.as2.dwt.erols.com...

Frank DeCarlo

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Only the latest Celerons have a 128K L2 cache. All earlier models
had none.

Frank

Earl Malmrose <earl.m...@nospam.sierra.com> wrote in message news:0CDy2.100$yo2...@news14.ispnews.com...

Matt Harris

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

from http://newmedia.com/newmedia/99/02/feature/server.html
In a new line of Apple Power Macs, the G4 will include up to 2MB of cache,
faster L2 and system bus bandwidth, multiprocessor readiness (which,
unfortunately, Mac OS X is not currently scheduled to support), and a
Katmai-like software-based instruction set called AltiVec.

Does this mean we won't be seeing multi-processor Macs any time soon?

Rick Tan

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <36cac48a...@nntp.service.ohio-state.edu>,
nospam@nospam!.kom (Steve) wrote:

> And the Mac advocates continue to promote the idea that Photoshop is
> the be all and end all of all applications. If somethings works in
> Photoshop, they believe it works for all applications, and if it
> doesn't work for Photoshop, they believe it doesn't work on all other
> applications. One instance of an application where SMP works does not
> mean that it works well in all applications any more than one
> application where it does not work means that it does not work in all
> applications. The advantage for a dual processor system running
> Windows NT4 ranges from a slight disadvantage to nearly 100% over a
> single processor system. The advantage is also dependent on the
> amount of memory in the system, with more memory being a bigger
> advantage with dual-processor systems.

I don't think this is the meat of the argument/discussion.

Mac advocates promote the idea that Photoshop is an important application
because it allows them to do what they need to do.

Windows advocates promote the idea that dual processor/SMP/PMT/Protected
Memory/WindowsNT/98/95/etc is what really counts to get them to do what
they need to do.

Earl Malmrose

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Frank DeCarlo wrote in message <7af3m8$q...@bgtnsc03.worldnet.att.net>...

>Only the latest Celerons have a 128K L2 cache. All earlier models
>had none.

Only the very first Celerons had none, which sold poorly, because they
performed poorly. That was quickly remidied with the A models, and now all
the socket 370 models have the cache.

Bill Frisbee

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Frank DeCarlo <f...@att.net> wrote in message
news:7af3m8$q...@bgtnsc03.worldnet.att.net...

>Only the latest Celerons have a 128K L2 cache. All earlier models
>had none.
>

>Frank

No only the VERY first release of the Celerons (266 and 300Mhz) were
released with no cache. They were trounced by the press and a month and a
half later, Intel released the 300A and 333. Both with 128K cache running at
CHIP speed. Now they are just ending 300a and 333a production moving fully
to 366, 400 and 433mhz. Soon we will be seeing 466 Celerons (rumour has it
they will be .18um die and MAYBE more L1 cache (64k)).

BTW Intel announced today that along with the Pentium III 450s and 500s they
will also be releasing the 550.

Bill F.

Steve

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

On Wed, 17 Feb 1999 13:08:36 -0500, bhold...@erols.remove.this.com
(WIlliam Holderness) wrote:

>In article <36cac48a...@nntp.service.ohio-state.edu>,
>nospam@nospam!.kom (Steve) wrote:
>

>>http://www.newmedia.com/NewMedia/98/09/labreport/charts.html
>
>Uhhh...Steve, I followed the links the first had no specs at all and was
>basically a "fluff" piece (as was the last link), the second link did show
>some performance graphs of real applications.

You didn't read the first article completely then, if you don't think
it had any specs. I'll enumerate them here again since you apparently
can't find them at Anand's web site
(http://www.anandtech.com/html/review_display.cfm?document=483)
Comparing a single processor PII 300 vs. a dual processor PII 300:
(some of the numbers are extracted from the text and some are based on
interpolating the data in the graphics)
64 MB RAM: no difference between systems
128 MB RAM: 34% advantage for Microstation SE
20% advantage for Photoshop 4.0
60% advantage for Visual C++ 5.0
256 MB RAM: 25% advantage for Microstation SE
40% advantage for Photoshop 4.0
60% advantage for Visual C++ 5.0
384 MB RAM: basically the same as 256 MB
512 MB RAM: basically the same as 256 MB

>
>However... of the systems benchmarked the only clear single/dual processor
>comparison (all other things being equal) was the HP Kayak's.

Yes, that's correct.

> And that
>showed little no performance enhancement from dual processors except on
>one specialized benchmark (MPEG2 compression)

I guess it depends on what you think is significant.

I'll summarize the results here again. The comparison system is an HP
Kayak XW 400 Mhz PII with single or dual processors. I am using the
single processor system as the base in calculating the percent
difference which accounts for a slight difference in the numbers I
have reported previously.

Viewperf: 9% advantage
Softimage
(unshaded): 22% advantage
(shaded): 23% advantage
Lightscape: 24% advantage
Indy 3D
(animation): 0% advantage
(simulation): 4% disadvantage
3D Studio Max
(rotate): 31% advantage
Photoshop: 19% advantage
Debabilizer: 31% disadvantage (here's one that does horribly).
MPEG1 enc: 21% advantage
MPEG2 enc: 34% advantage
Studio Pro: 1% advantage
Premiere: 0% advantage

>Isn't that what people like Joe R. have been saying all along, that dual
>processors are of little performance advantage EXCEPT where the process is
>easily divisible and therefore easily distributed? (which MPEG2
>compression is)

Or where you multi-task, which issue you continue to avoid.

>I would like to add that processes like MPEG2 compression are much better
>off on specialized hardware cards/systems that can encode/decode/preview
>MPEG2 in real time, about a 10X difference over software CODEC's (as a
>side note; the Heuris encoder's performance on a single G3 is smoking
>fast)

Many things are better on specialized hardware, it just depends
whether or not the performance advantage vs. the additional cost is
worth.

>Interestingly enough the benchmarks (dell 410 Vs 610) also show little
>performance advantages of the Xeon over the PentiumII, which is something
>else I have heard Joe R. (and other in CSMA) say time and time again.

SPEC claims about a 10% advantage for a Xeon which seems about right
to me.

>The remaining two links were just the same tired old Spec marks, which
>many (far better educated and qualified than I) argue are synthetic,
>contrived and not representative of real world performance. Isn't it
>interesting that the difference between the real world program benchmarks
>of the second link and the specmarks of the third and fourth links seem to
>confirm the error inherent in SpecMarks?

No, they don't do so at all. SPECrate shows the ability of
multi-threaded SMP-aware apps to scale across multiple processors.
SPECrate is not designed to show the average performance of
applications in general on multiple processor systems.

What error in SPEC are you talking about? We've been over this issue
numerous times and I've showed that SPECint95 does a reasonably good
job of predicting the average relative difference in application
performance between PIIs and G3s. The claims by some people that SPEC
does not accurately represent the performance difference between PIIs
and Xeons and G3s and 604es is false and I've also addressed this
issue on other threads.

Again, the Mac advocate *absolutely* refused to address the issue of
multi-tasking. Have Mac advocates become so defensive about CMT and
PMT that they have decided to ignore all posts that refer to
multi-tasking?

mac...@pobox.com

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

I guess he was talking about the Celeron 300A (128 KB of L2 cache
running at the full speed of the processor) not the Celeron 300 (no L2
cache) since in a dual configuration the cachesles Celeron's would
perform very poorly (even in single mode :-)).

For most applications the Celeron's (excepting those old 266 and 300
models) are just as fast as a PII running at the same speed (i.e.
Celeron 300A running at 450 MHz against a PII 450), since the smaller
L2 cache (128 KB vs. 512 KB) runs at the full speed of the procesor
while the PII's cache runs at half the speed (450 MHz vs. 225 MHz). As
a matter of fact the Celeron's are even faster than the PII in games
:-)).

Francisco Garcia Maceda
mac...@pobox.com

mac...@pobox.com

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

On 17 Feb 1999 19:01:28 GMT, "Frank DeCarlo" <f...@att.net> wrote:

>Only the latest Celerons have a 128K L2 cache. All earlier models
>had none.
>
>Frank

I would refrase this as: Only the first two Celeron's were cacheles
(266 and 300) and are out of production. All current models have 128
KB L2 cache's running at the full speed of the processor (300A, 333,
366, 400 and soon 433).

Chad Irby

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

"Frank DeCarlo" <f...@att.net> wrote:

> The results at http://www.toolbox.net/300vs300/index.html are in line
> with the benchmarking I've done so far. A dual P-II NT system simply
> blows away a single G3.

Then you need to run the PS5 benchmark, and get those numbers up there.
The PS5 is much more comprehensive, and covers a much wider range of
operations.

--

Chad Irby \ My greatest fear: that future generations will,
ci...@magicnet.net \ for some reason, refer to me as an "optimist."

Paul

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Earl Malmrose wrote:

> Frank DeCarlo wrote in message <7af3m8$q...@bgtnsc03.worldnet.att.net>...

> >Only the latest Celerons have a 128K L2 cache. All earlier models
> >had none.
>

> Only the very first Celerons had none, which sold poorly, because they
> performed poorly. That was quickly remidied with the A models, and now all
> the socket 370 models have the cache.

The early Celerons are great BECAUSE they had no cache! no cache means easy
overclocking which means greater performance for a lot less money.

just my .02

TacitR

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

>In a new line of Apple Power Macs, the G4 will include up to 2MB of cache,
>faster L2 and system bus bandwidth, multiprocessor readiness (which,
>unfortunately, Mac OS X is not currently scheduled to support), and a
>Katmai-like software-based instruction set called AltiVec.
>
>
>Does this mean we won't be seeing multi-processor Macs any time soon?

We already have multi-processor Macs. Either you misread the article, or the
person who wrote it misunderstood Apple's OS-X documentation.

Multi-processor Macs have been around for quite a while; System versions later
than (and including) 7.5.2 support multiprocessing. Clone maker Daystar Digital
makes a line of multiprocessor Macs (as well as dual-processor and
quad-processor cards for PCI Power Macs); in fact, I use a four-processor
Daystar Genesis machine with 1GB physical RAM and 4GB fast/wide SCSI RAID array
as a dedicated Photoshop computer at work. Apple's sole multiprocessor
configuration, however, was the Power Macintosh 9600MP, which shipped with two
processors.

Incidentally, I spoke to an Apple system engineer last month and asked him why
Apple isn't offering any multiprocessor G3 systems. His response: "We have them
running in the design lab right now. We had some backside cache issues to
resolve, but we've got them licked." I don't know if that means Apple plans to
bring them to market or not.

Incidentally, on a multiprocessor Mac, Photoshop shows significantly higher
processor utilization than it does on a Pentium/NT system. That's part of the
reason that we went with the Genesis over a multiprocessor Intergraph ExtremeZ
running Windows NT. We put the machines side by side and ran them through a
bunch of identical Photoshop tasks on large (~800MB) files, and the Genesis
completely humiliated the Intergraph. There wasn't even any serious
competition.

------
Onyx, the game of sexual exploration; Xero, the industrial magazine
of art, fiction and photography; and online photo gallery--all at
http://www.xeromag.com/franklin.html

Chad Irby

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

nospam@nospam!.kom (Steve) wrote:

> I guess it depends on what you think is significant.
>
> I'll summarize the results here again. The comparison system is an HP
> Kayak XW 400 Mhz PII with single or dual processors. I am using the
> single processor system as the base in calculating the percent
> difference which accounts for a slight difference in the numbers I
> have reported previously.

(list of apps and speedups deleted)

...of course, you neglect to mention that the speedups you see above were
for a few apps that were specifically enabled to use SMP.

You keep harping about the possible advantage of using a non-SMP app on a
dual-proc system and how you can run two things without having either of
them slow down the other, but you also don't mention that you don't see an
actual *speed increase* from about 95% of the apps out there. You just
don't get that much of a speed *loss*.

That's been our point all along, and it's been a major win just to get
*some* NT folks to admit that adding a second processor does *not* cause a
speed gain across the board when running one compute-intensive
application.

Sure, you can run a 3DS Max render in the background while playing a tough
game in the foreground, but you *are* going to get some slowdowns, because
a lot of the speed of that 3-D app depends on using the dedicated hardware
on the 3-D card you need to run Unreal... and if you don't buy a *lot* of
extra system RAM, *both* processes are going to be RAM-starved.

Earl Malmrose

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

Paul wrote in message <36CAEF76...@webzone.net>...

No. The Celerons w/cache are just as overclockable.

Jerry Kindall

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <19990217182513...@ng99.aol.com>, tac...@aol.com
(TacitR) wrote:

> Incidentally, I spoke to an Apple system engineer last month and asked him why
> Apple isn't offering any multiprocessor G3 systems. His response: "We
have them
> running in the design lab right now. We had some backside cache issues to
> resolve, but we've got them licked." I don't know if that means Apple plans to
> bring them to market or not.

No point in that with the G4 so close. In fact, the G3's cache
architecture isn't well suited to multiprocessing, so I wonder why they'd
even bother trying to solve the problems when a better chip with none of
the problems is on its way.

--
Jerry Kindall mailto:kin...@mail.manual.com Technical Writing
Manual Labor http://www.manual.com We wrote the book!

"During your times of trial and suffering, when you see only one
set of footprints, it was then that I was riding the pogo stick."

Thingfishhhh

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
(davexl) wrote:

> Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
> hardly any of it IS SMP. See table below.

Nor was it very good running Mac version 4.0 with a Newer dual 133mhz 604
upgrade card - we found that only SOME of the plug-ins were MP supported,
and the program itself did not make very good use of both, if at all. We
found that a Newer 200mhz single 604e ran much faster than the dual
set-up.

Sadly, I think Photoshop will remain this way until the majority of users
have multi processors - Adobe's stubborn that way. Lowest common
denominator is the rule of the day.

Oh well. Personally, I find I can get quite a speed boost by setting up PS
correctly. Give it a ton of ram (ever used PS with over 500 megs of ram? I
have, and damn it, I *miss* it. Too bad that company was hell to work for)
so PS doesnt have to hit the scratch disk very often, have FAST drives,
dedicate at *LEAST* a gig to pure scratch disk (This is CRUCIAL), and have
a VERY fast video card with a lot of vram, and PS really *performs*.

Thingfishhhh

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <36cac48a...@nntp.service.ohio-state.edu>,
nospam@nospam!.kom (Steve) wrote:

> And the Mac advocates continue to promote the idea that Photoshop is
> the be all and end all of all applications.

No, it's simply a rock-solid, dependable application with an *enormous*
user base. If Corel or any other program had a similar reputation, they
would be used in the tests.

And, speaking as a 12 year graphics/pre-press professional, Photoshop IS
the end-all be-all in image editing/color correction. NOTHING compares.
Corel is a JOKE when it comes to color correction, stable and accurate
RGB-CMYK conversion, all around toolsets, and stability - and Corel is the
only thing that comes close (Except in Unix - TIFFany is pretty darned
good, but still not as good as PS).

Photoshop is used, even by PC advocates, by the way, because of PS's
ability to bring just about any system to it's knees - one test I always
run on an unfamiliar system is Unsharp Mask, to get a general sense of
speed. Unsharp Mask, along with Gaussian Blur, are CPU intensive tasks
that visually and physically demonstrate the true *day to day* speed of a
system, not a series of benchmarks that mean little to the average user.
Gee - why do you think a lot of benchmarks USE PS?

If somethings works in
> Photoshop, they believe it works for all applications, and if it
> doesn't work for Photoshop, they believe it doesn't work on all other
> applications.

Don't speak for all Mac users - you're a poor representative, and you're
assigning a lot of behavior that is mirrored *exactly* in the Wintel Herd
crowd. I've seen countless graphs with the word Photoshop attached to it
describing Wintel hardware and software for you to get away with an
asinine accusation like this.

Chris Cox

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
(davexl) wrote:

> I am sick to death of this misinformation.
>

> Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
> hardly any of it IS SMP. See table below.
>

> I ran the infamous Photoshop 5 action file on a dual Celeron 450 in

> one and two CPU configuration. To summarise: SMP is implemented so
> badly that the difference on this test was 7%. Yes, seven whole
> percent. Only five functions in this test were multithreaded. This is
> not a biased example of choosing functions either - notice that the
> few that are SMP are counted multiple times.

Please check your facts before showing your ignorance in a public forum.

The biggest problem is that many of the operations you tested are bandwidth
limited. If you add more processors, they're still limited by how fast the
system can move data to and from DRAM. If you have multiple processors
trying to do something that is limited by DRAM bandwidth, it may actually
go SLOWER than having a single processor working on that task (unless
you've got one heck of a big crossbar for a system bus like certain
workstations and supercomputers).

If Digital Fusion is showing large gains from multiple processors, then
either they are severely CPU limited or they need to do a lot more
optimization on their code (most likely both).

Chris

Dave Cardinal

unread,

Feb 17, 1999, 3:00:00 AM2/17/99

to

As far as the Xeon vs Pentium II, it seems like the major advantage it has
is the ability to go to more than 2 processors, so that vendors can build
quad servers again (rather than keep selling Pentium Pro quad servers
indefinitely).

--Dave Cardinal
http://www.ourhouse.com/photo

Steve <nospam@nospam!.kom> wrote in message
news:36cb2725...@nntp.service.ohio-state.edu...

Steve

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

On 17 Feb 1999 23:31:54 GMT, ci...@magicnet.net (Chad Irby) wrote:

>nospam@nospam!.kom (Steve) wrote:
>
>> I guess it depends on what you think is significant.
>>
>> I'll summarize the results here again. The comparison system is an HP
>> Kayak XW 400 Mhz PII with single or dual processors. I am using the
>> single processor system as the base in calculating the percent
>> difference which accounts for a slight difference in the numbers I
>> have reported previously.
>

>(list of apps and speedups deleted)
>
>...of course, you neglect to mention that the speedups you see above were
>for a few apps that were specifically enabled to use SMP.
>
>You keep harping about the possible advantage of using a non-SMP app on a
>dual-proc system and how you can run two things without having either of
>them slow down the other, but you also don't mention that you don't see an
>actual *speed increase* from about 95% of the apps out there. You just
>don't get that much of a speed *loss*.
>
>That's been our point all along, and it's been a major win just to get
>*some* NT folks to admit that adding a second processor does *not* cause a
>speed gain across the board when running one compute-intensive
>application.
>
>Sure, you can run a 3DS Max render in the background while playing a tough
>game in the foreground, but you *are* going to get some slowdowns, because
>a lot of the speed of that 3-D app depends on using the dedicated hardware
>on the 3-D card you need to run Unreal... and if you don't buy a *lot* of
>extra system RAM, *both* processes are going to be RAM-starved.

Grandma is not going to see an advantage from a SMP system, but she's
not going to see an advantage for an increase in single-processor
speed either. I commonly saturate a 1-processor systerm (100% CPU
utilization) running GIS applications or a combination of a dozen or
so other applications. With a SMP system I can run multiple instances
of the same application and/or multiple applications and get a near
doubling of productivity. The reason is that I get twice as much work
done in the same time (i.e. run 2 raster to vector conversions in the
same amount of time it would take to get 1 conversion done on a single
processor system. Although you could get the same effect by
purchasing another computer, by purchasing a SMP system you avoid the
cost of duplicating everything but the processor, save office space,
and are able to take advantage of SMP-enabled applications.

I am not trying to argue that most people would see enough of a speed
boost from a SMP system to make it cost effective. I am saying that
many people can see very large productivity increases and good cost
effectiveness if they run SMP-aware applications and/or multi-task CPU
intensive applications. I hope that this discussion will make people
look before they leap as well as show that SMP can be very useful. An
SMP system does not automatically infer a large speedup. However,
those people that need serious power and who typically multi-task
CPU-intensive applications will see cost-effective productivity gains
from a SMP system.

Chad Irby

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

nospam@nospam!.kom (Steve) wrote:

> I am not trying to argue that most people would see enough of a speed
> boost from a SMP system to make it cost effective.

There's the difference. We're continually getting NT dudes in here who
keep telling us how their machines are so much faster in *everything*, not
just in some cases.

> I am saying that
> many people can see very large productivity increases and good cost
> effectiveness if they run SMP-aware applications and/or multi-task CPU
> intensive applications.

More like "some" people can, and the caveats are quite large.

davexl

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

Christian Bau <christ...@isltd.insignia.com> wrote:

>davexl wrote:
>>
>> I am sick to death of this misinformation.
>>
>> Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
>> hardly any of it IS SMP. See table below.
>

>In the context of this newsgroup, there is the continuing debate between
>Macintosh users saying "CMT works in practice" and trolls saying "only
>PMT works". The debate is between "works in practice" and "magic
>bullet".
>
>So someone posted that in the case of Photoshop and SMP, the "magic
>bullet" approach doesnt work. The argument up to this point is like
>this:
>
>Troll: SMP works by magic.
>Mac user: No it doesnt.
>Troll: Yes it does.
>Someone posted: I tried Photoshop on Windows NT on a dual processor
>machine, and the SMP magic bullet didnt work.
>davexl: I am sick to death of this misinformation. The fault is with
>Photoshop, not with the magic bullet.
>
>So the magic bullet SMP only works if the application writer
>cooperates??? Just like multiprocessing on the Macintosh works if the
>application writer cooperates???

Yes and no.

Yes on all platforms you need to write for it to get the full benefit.

(BUT on NT you can get around that by running 2 copies/another app)

No, MP on the Mac has never reached the standard of a fully threaded
NT native application. To many inefficiencies and gotchas like RAM
partitions per CPU etc.

davexl

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

"Frank DeCarlo" <f...@att.net> wrote:

>The results at http://www.toolbox.net/300vs300/index.html are in line
>with the benchmarking I've done so far. A dual P-II NT system simply
>blows away a single G3.

DEPENDING on the functions!

Photoshop tests show anywhere from -30% to +80% advantage to a dual
P-II. If I was talking Gaussian blurs then yes, our dual P-II systems
blow away G3s. If we are talking Color Halftone filters, then the
position is reversed.

Both sides can cite Photoshop results and both be correct in that
instance.

The amount of power available on a dual P-II 'blows away' a single G3,
but only if it can be used. Which, thankfully, it normally is :-)

davexl

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

ci...@magicnet.net (Chad Irby) wrote:

>nospam@nospam!.kom (Steve) wrote:
>
>> I guess it depends on what you think is significant.
>>
>> I'll summarize the results here again. The comparison system is an HP
>> Kayak XW 400 Mhz PII with single or dual processors. I am using the
>> single processor system as the base in calculating the percent
>> difference which accounts for a slight difference in the numbers I
>> have reported previously.
>

>(list of apps and speedups deleted)
>
>...of course, you neglect to mention that the speedups you see above were
>for a few apps that were specifically enabled to use SMP.
>
>You keep harping about the possible advantage of using a non-SMP app on a
>dual-proc system and how you can run two things without having either of
>them slow down the other, but you also don't mention that you don't see an
>actual *speed increase* from about 95% of the apps out there. You just
>don't get that much of a speed *loss*.

True. No pain, much to gain. Run two.

>That's been our point all along, and it's been a major win just to get
>*some* NT folks to admit that adding a second processor does *not* cause a
>speed gain across the board when running one compute-intensive
>application.

Well, if it helps, this NT guy admits it.

I still think it is splitting hairs because I am always running more
than one thing. It would be a major win to have more Mac folks sit in
front of one of these machines and use it, in which case you
immediately notice the difference in speed and responsiveness on
everything.

>Sure, you can run a 3DS Max render in the background while playing a tough
>game in the foreground, but you *are* going to get some slowdowns, because
>a lot of the speed of that 3-D app depends on using the dedicated hardware
>on the 3-D card you need to run Unreal...

I take your point, but in this example you are incorrect, and MAX
rendering does not use the 3D card at all.

Yes you occasionally get slowdowns, but it is a sort of annoyance I
will gladly take over the alternative.

It has been my experience that two single threaded tasks hardly bother
each other - NT just pops one on each CPU and off they go. RAM is the
only bottleneck, once you compare the cost of upping your RAM versus
an extra machine, I put this problem in the 'nice to have' basket.

Mind you, this assumes you aren't trying to play two different games
at once on two monitors!

>and if you don't buy a *lot* of
>extra system RAM, *both* processes are going to be RAM-starved.

Hardly.

Yes you need some more RAM. But less than if you bought two machines.
NT can actually divvy it up efficiently and dynamically remember.

You say a *lot* like it was some sort of dirty thing to need more RAM.
As if doubling your RAM was worse than buying a whole extra machine.
Not that you NEED to double it, but you get my point.

Not every app is Photoshop. (RAM hungry) The two other main apps I run
each day are very CPU hungry, but not as RAM hungry (usually!) They
can run all the time as background tasks.

Put another way, if I run this Photoshop test while also running a
large 3D Studio MAX render, and a Digital Fusion comp, the scores stay
exactly the same. Without SMP I would have to buy another single P-II
machines ("a G3") before I would complete the same tasks in the same
time as what I have now.

Obviously if they were all I/O bound or fighting for the same resource
this would not work, but I have found in animation and graphics a
large number of apps where SMP makes a huge boost to our productivity.

Despite Photoshop being lame in SMP, I can still do about twice the
work I could otherwise do on a Mac in the same time.

mac...@pobox.com

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

Haven't you heard about the Celeron 300A. This is the king of
overclocking, and it has L2 cache. +90% success to 450 MHz (even if
some need a little extra voltage).

David T. Wang

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

Chris Cox (cc...@slip.net) wrote:
: In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
: (davexl) wrote:

: > I am sick to death of this misinformation.
: >
: > Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
: > hardly any of it IS SMP. See table below.

: >
: > I ran the infamous Photoshop 5 action file on a dual Celeron 450 in

: > one and two CPU configuration. To summarise: SMP is implemented so
: > badly that the difference on this test was 7%. Yes, seven whole
: > percent. Only five functions in this test were multithreaded. This is
: > not a biased example of choosing functions either - notice that the
: > few that are SMP are counted multiple times.

: Please check your facts before showing your ignorance in a public forum.

: The biggest problem is that many of the operations you tested are bandwidth
: limited. If you add more processors, they're still limited by how fast the
: system can move data to and from DRAM. If you have multiple processors
: trying to do something that is limited by DRAM bandwidth, it may actually
: go SLOWER than having a single processor working on that task (unless
: you've got one heck of a big crossbar for a system bus like certain
: workstations and supercomputers).

Please point out the operations which are bandwidth limited, then we
can check against some of the numbers. There are Pentium II's
with 66 Mhz and 100 MHz busses, and Macs with 50/66/100 Mhz
memory busses. With publically available data, we can see the amount
of improvement that increased memory bandwidth has brought.
(e.g. there should be little difference between the 350/400/450 MHz
Pentium II's, but big one between the 333 and 350 MHz Pentium II.
Similar deltas should exist for the Macintosh platform)

With this data, we can better decide if your assertion of memory-bandwidth
saturation or Mr Davexl's assertion of poor threading is the bottleneck
for those specific functions in dispute.

(I haven't checked the data, merely asking which ones we ought to be
looking at, which may prove/disprove your assertion.)

: If Digital Fusion is showing large gains from multiple processors, then

: either they are severely CPU limited or they need to do a lot more
: optimization on their code (most likely both).

: Chris

--
No SPAM or email wholly without substance please
dave...@cslab.kecl.ntt.co.jp.I.like.green.eggs.and.ham,not.spam
All statements are personal opinions
Not speaking for NTT or University of Maryland
Kyoto, Japan.

davexl

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

cc...@slip.net (Chris Cox) wrote:

>In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
>(davexl) wrote:
>
>> I am sick to death of this misinformation.
>>
>> Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
>> hardly any of it IS SMP. See table below.
>>
>> I ran the infamous Photoshop 5 action file on a dual Celeron 450 in
>> one and two CPU configuration. To summarise: SMP is implemented so
>> badly that the difference on this test was 7%. Yes, seven whole
>> percent. Only five functions in this test were multithreaded. This is
>> not a biased example of choosing functions either - notice that the
>> few that are SMP are counted multiple times.
>
>Please check your facts before showing your ignorance in a public forum.
>
>The biggest problem is that many of the operations you tested are bandwidth
>limited. If you add more processors, they're still limited by how fast the
>system can move data to and from DRAM. If you have multiple processors
>trying to do something that is limited by DRAM bandwidth, it may actually
>go SLOWER than having a single processor working on that task (unless
>you've got one heck of a big crossbar for a system bus like certain
>workstations and supercomputers).
>

>If Digital Fusion is showing large gains from multiple processors, then
>either they are severely CPU limited or they need to do a lot more
>optimization on their code (most likely both).
>
>Chris

Musta struck a nerve.

With respect Chris, you haven't the slightest clue. Not even a little
one. Not only are you wrong, but you are out by orders of magnitude.

NONE of those Photoshop operations are bandwidth limited.

10 meg of image, which can't be shunted from RAM to processor in the
18 seconds to perform the filter? In one second even? That is a slow
bus.

Assuming a pessimistic bandwidth of 100MB a sec, and you begin to see
that bus speed and RAM speed are the least of our problems.

If Photoshop were bandwidth limited, then the CPU would be starved and
all operations would take a minimum time.

How do you then explain that a 90 degree rotate of this image takes
0.3 of a second, yet a water color filter takes 27.3 seconds, and the
BANDWIDTH stopped the second CPU having a crack at it?

Most 2D and 3D graphics operations are "severely" CPU limited. Very
few applications shunt hundreds of MB a second around. If they do, the
didn't get it off disk, that's for sure.

Raster manipulation of bitmaps in 2D graphics and video effects is
both CPU bound and a very good candidate for SMP. I am not saying
Adobe can't write good code. I am saying they haven't bothered.

If you had ever seen Digital Fusion in action you would never accuse
them of writing poor code. The other reason Fusion is so high
performance is that while it typically works on smaller individual
frames than Photoshop might, it has to process thousands of them in
sequence, where piss poor performance would be noticed. Like another
Adobe product, After Effects.

If a Photoshop filter takes 5 seconds instead of ten, who cares? If a
render takes 5 hours instead of ten...

If Digital Fusion is using 100% of two CPUs because of inefficient
code, tell me how I get a 199% speedup on a dual system?

Be creative, we need another laugh.

>Please check your facts before showing your ignorance in a public forum.

Sound, sound advice...

davexl

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

ci...@magicnet.net (Chad Irby) wrote:

>nospam@nospam!.kom (Steve) wrote:
>
>> I am not trying to argue that most people would see enough of a speed
>> boost from a SMP system to make it cost effective.
>
>There's the difference. We're continually getting NT dudes in here who
>keep telling us how their machines are so much faster in *everything*, not
>just in some cases.
>
>> I am saying that
>> many people can see very large productivity increases and good cost
>> effectiveness if they run SMP-aware applications and/or multi-task CPU
>> intensive applications.
>
>More like "some" people can, and the caveats are quite large.

Ever looked into a MP Mac like the Daystars?

Definition of caveat. Which is why they never took off.

Phil Brewster

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

On Wed, Feb 17, 1999 9:39 PM, davexl
<mailto:dav...@ozramp.net.au.despam-me!> wrote:
>"Frank DeCarlo" <f...@att.net> wrote:
>
>>The results at http://www.toolbox.net/300vs300/index.html are in line
>>with the benchmarking I've done so far. A dual P-II NT system simply
>>blows away a single G3.
>
>DEPENDING on the functions!
>
>Photoshop tests show anywhere from -30% to +80% advantage to a dual
>P-II. If I was talking Gaussian blurs then yes, our dual P-II systems
>blow away G3s. If we are talking Color Halftone filters, then the
>position is reversed.
>

Is the key difference really an issue of multithreading and SMP support, as
between sets of Photoshop filters, instead of between Photoshop and apps in
general that 'do SMP right', though?

It's always seemed to me that MMX-accelerated vs. unaccelerated filters
showed the most dramatic discrepancies in these tests, _within_ Photoshop.

But maybe the underlying code for the unaccelerated filters in Windows is
simply more archaic in general, so that it does indeed offer less SMP
support than in the case of the MMX-accelerated filters.....

(As I've mentioned before in previous threads, Adobe also has to support
SMP-less Win9x users with this generic Win PS code base, lest we forget....
<g>)

>Both sides can cite Photoshop results and both be correct in that
>instance.
>
>The amount of power available on a dual P-II 'blows away' a single G3,
>but only if it can be used. Which, thankfully, it normally is :-)
>
>
>

Just wondering: do direct Photoshop competitors on NT such as Corel
PhotoPaint 8 or Paintshop Pro offer any better SMP support than Adobe does?

If so, then maybe it will motivate Adobe to get with the SMP program on NT.
If not, though, then why should they, as the image-editing industry's
proverbial 800-lb gorilla, really bother?....

Cheers,

--

Phil Brewster <pjbrew at ix dot netcom dot com>

"It requires a very unusual mind to undertake the analysis of the obvious."

-- Alfred North Whitehead

Josh Lewis

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

In article <7af7sq$f...@sjx-ixn5.ix.netcom.com>, "Matt Harris"
<do...@ix.netcom.com> wrote:

> from http://newmedia.com/newmedia/99/02/feature/server.html

> In a new line of Apple Power Macs, the G4 will include up to 2MB of cache,
> faster L2 and system bus bandwidth, multiprocessor readiness (which,
> unfortunately, Mac OS X is not currently scheduled to support), and a
> Katmai-like software-based instruction set called AltiVec.
>
>
> Does this mean we won't be seeing multi-processor Macs any time soon?
>

8.6 not osx is supposed to be getting a big MP push soon. I do not know if
that means Apple will produce MP hardware soon or not.

>
> >I am dying for OS-X to be
> >released so Macs and Mac first software companies like Adobe will
> >'get' the concept of SMP.
> >

Josh Lewis

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

In article <36d3991c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
(davexl) wrote:

> "Frank DeCarlo" <f...@att.net> wrote:
>
> >The results at http://www.toolbox.net/300vs300/index.html are in line
> >with the benchmarking I've done so far. A dual P-II NT system simply
> >blows away a single G3.
>
> DEPENDING on the functions!
>
> Photoshop tests show anywhere from -30% to +80% advantage to a dual
> P-II.

B>S many of the widoz advocates have had trouble getting 7% performance
increase out of a Dual P2 system. Most apps don't see more than 20%- 30%
IF THEY ARE MP AWARE and a flat 0% if they are not.

While I don't doubt the results at toolbox.com and I don't question the
authors intentions or critisims of his G3 300. I do think that
xlr8yourmac's benchmarks are more comprehensive and that there is
something severly wrong with NT's implimentation.

Steve

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

On Wed, 17 Feb 1999 22:48:10 -0800, Thingf...@yahoo.com
(Thingfishhhh) wrote:

>In article <36cac48a...@nntp.service.ohio-state.edu>,
>nospam@nospam!.kom (Steve) wrote:
>

>> And the Mac advocates continue to promote the idea that Photoshop is
>> the be all and end all of all applications.
>
>No, it's simply a rock-solid, dependable application with an *enormous*
>user base. If Corel or any other program had a similar reputation, they
>would be used in the tests.

That's fine, but Photoshop results by themselves cannot be
extrapolated to the general case.

>And, speaking as a 12 year graphics/pre-press professional, Photoshop IS
>the end-all be-all in image editing/color correction. NOTHING compares.
>Corel is a JOKE when it comes to color correction, stable and accurate
>RGB-CMYK conversion, all around toolsets, and stability - and Corel is the
>only thing that comes close (Except in Unix - TIFFany is pretty darned
>good, but still not as good as PS).

See, you are projecting your needs onto everyone else. In my field
(GIS) Arc/Info is the standard and Photoshop is irrelevant. I have no
problem with including Photoshop as one app to test among many. But
Photoshop by itself is meaningless to people who don't use Photoshop.

>Photoshop is used, even by PC advocates, by the way, because of PS's
>ability to bring just about any system to it's knees - one test I always
>run on an unfamiliar system is Unsharp Mask, to get a general sense of
>speed. Unsharp Mask, along with Gaussian Blur, are CPU intensive tasks
>that visually and physically demonstrate the true *day to day* speed of a
>system, not a series of benchmarks that mean little to the average user.
>Gee - why do you think a lot of benchmarks USE PS?

PC benchmarks use Photoshop as one of many applications to test. If a
PC magazine was to publish a cross platform test without Photoshop the
Mac advocates would be infuriated. You very seldom see a PC benchmark
test where Photoshop is the *only* application tested, whereas that is
often the case with Mac tests.

> If somethings works in
>> Photoshop, they believe it works for all applications, and if it
>> doesn't work for Photoshop, they believe it doesn't work on all other
>> applications.
>
>Don't speak for all Mac users - you're a poor representative, and you're
>assigning a lot of behavior that is mirrored *exactly* in the Wintel Herd
>crowd. I've seen countless graphs with the word Photoshop attached to it
>describing Wintel hardware and software for you to get away with an
>asinine accusation like this.

My point (made with *SARCASM*) is that the results for Photoshop do
not necessarliy predict the results for other applications. Photoshop
is an important application on the Mac platform. It is less important
on the PC platform. It is even less important to me. I have a number
of GIS applications that make Photoshop look like small potatoes.

Chad Irby

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

dav...@ozramp.net.au.despam-me! (davexl) wrote:

> ci...@magicnet.net (Chad Irby) wrote:
>
> >More like "some" people can, and the caveats are quite large.
>
> Ever looked into a MP Mac like the Daystars?
>
> Definition of caveat. Which is why they never took off.

Exactly. That's part of the reason I never bought one.

Multiprocessor machines, by and large, are still quite useless for general
computing tasks for single users.

Phil Brewster

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

On Thu, Feb 18, 1999 3:15 AM, Josh Lewis
<mailto:jsl...@spamsucks.erols.com> wrote:
>In article <7af7sq$f...@sjx-ixn5.ix.netcom.com>, "Matt Harris"
><do...@ix.netcom.com> wrote:
>
>> from http://newmedia.com/newmedia/99/02/feature/server.html
>> In a new line of Apple Power Macs, the G4 will include up to 2MB of
cache,
>> faster L2 and system bus bandwidth, multiprocessor readiness (which,
>> unfortunately, Mac OS X is not currently scheduled to support), and a
>> Katmai-like software-based instruction set called AltiVec.
>>
>>
>> Does this mean we won't be seeing multi-processor Macs any time soon?
>>
>
>8.6 not osx is supposed to be getting a big MP push soon. I do not know if
>that means Apple will produce MP hardware soon or not.
>

As always, Apple hasn't announced anything official (yet), but according to
the _rumored_ specs for 'Sawtooth' (Apple G4 Power Macs, the next step up
after Yosemite....), a dual-processor G4 option from Apple as well as MP G4
cards from third parties may indeed be a possibility 8-12 months from now
(so it depends on your definition of 'soon', I guess.... <g>).

See

http://www.macosrumors.com/sawtooth.html

In any event, I think it's safe to say there won't be any MP Apple hardware
_before_ Macs have G4-generation CPUs.....

;-)

(I also think it likely that Apple will wait until OS 8.6 is out the door,
which would mean July 1999 at the very, very, _very_ earliest.... Late 1999
or early 2000 is more likely, IMO, so they don't 'Osborne' too many sales
of Yosemite in the next 8-10 months or so, by marketing its successor model
too soon.... Well, 'too soon' from Apple Marketing's perspective, anyway,
since there haven't been MP Macs available from Apple since the 9600s with
the last 604e CPUs, almost 2 years ago now [the G3 chips do not support MP
by design, due to decisions made by IBM/Moto way back when, IIRC]....)

>
>>
>> >I am dying for OS-X to be
>> >released so Macs and Mac first software companies like Adobe will
>> >'get' the concept of SMP.
>> >
>

Cheers,

Thingfishhhh

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

In article <36cc0c6c...@nntp.service.ohio-state.edu>,
nospam@nospam!.kom (Steve) wrote:

> My point (made with *SARCASM*)

You're assuming mine had none of *that*.

Chris Cox

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

In article <7agfk6$383$1...@hecate.umd.edu>,
dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> Chris Cox (cc...@slip.net) wrote:
> : In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
> : (davexl) wrote:
>
> : > I am sick to death of this misinformation.
> : >
> : > Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
> : > hardly any of it IS SMP. See table below.
> : >
> : > I ran the infamous Photoshop 5 action file on a dual Celeron 450 in
> : > one and two CPU configuration. To summarise: SMP is implemented so
> : > badly that the difference on this test was 7%. Yes, seven whole
> : > percent. Only five functions in this test were multithreaded. This is
> : > not a biased example of choosing functions either - notice that the
> : > few that are SMP are counted multiple times.
>
> : Please check your facts before showing your ignorance in a public forum.
>
> : The biggest problem is that many of the operations you tested are bandwidth
> : limited. If you add more processors, they're still limited by how fast the
> : system can move data to and from DRAM. If you have multiple processors
> : trying to do something that is limited by DRAM bandwidth, it may actually
> : go SLOWER than having a single processor working on that task (unless
> : you've got one heck of a big crossbar for a system bus like certain
> : workstations and supercomputers).
>

> Please point out the operations which are bandwidth limited, then we
> can check against some of the numbers.

I can't do that (too much proprietary info) -- but I can tell you that most
of the common operations in Photoshop are bandwidth limited on current
desktop systems.

>There are Pentium II's
> with 66 Mhz and 100 MHz busses, and Macs with 50/66/100 Mhz
> memory busses. With publically available data, we can see the amount
> of improvement that increased memory bandwidth has brought.
> (e.g. there should be little difference between the 350/400/450 MHz
> Pentium II's, but big one between the 333 and 350 MHz Pentium II.
> Similar deltas should exist for the Macintosh platform)

It's not that simple.

>
> With this data, we can better decide if your assertion of memory-bandwidth
> saturation or Mr Davexl's assertion of poor threading is the bottleneck
> for those specific functions in dispute.

Read the credits in Photoshop 4 or 5.
Then decide.

Chris

PS. Photoshop's MP code was originally written for NT, not Macintosh.

Chris Cox

unread,

Feb 18, 1999, 3:00:00 AM2/18/99

to

In article <36e1cbd8....@10.1.1.9>, dav...@ozramp.net.au.despam-me!
(davexl) wrote:

> cc...@slip.net (Chris Cox) wrote:
>
> >In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
> >(davexl) wrote:
> >
> >> I am sick to death of this misinformation.
> >>
> >> Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
> >> hardly any of it IS SMP. See table below.
> >>
> >> I ran the infamous Photoshop 5 action file on a dual Celeron 450 in
> >> one and two CPU configuration. To summarise: SMP is implemented so
> >> badly that the difference on this test was 7%. Yes, seven whole
> >> percent. Only five functions in this test were multithreaded. This is
> >> not a biased example of choosing functions either - notice that the
> >> few that are SMP are counted multiple times.
> >
> >Please check your facts before showing your ignorance in a public forum.
> >
> >The biggest problem is that many of the operations you tested are bandwidth
> >limited. If you add more processors, they're still limited by how fast the
> >system can move data to and from DRAM. If you have multiple processors
> >trying to do something that is limited by DRAM bandwidth, it may actually
> >go SLOWER than having a single processor working on that task (unless
> >you've got one heck of a big crossbar for a system bus like certain
> >workstations and supercomputers).
> >

> >If Digital Fusion is showing large gains from multiple processors, then
> >either they are severely CPU limited or they need to do a lot more
> >optimization on their code (most likely both).
> >
> >Chris
>
> Musta struck a nerve.

I just hate misinformation and the spread of ignorance.

>
> With respect Chris, you haven't the slightest clue. Not even a little
> one. Not only are you wrong, but you are out by orders of magnitude.

With respect: go read the credits on Photoshop 4 or 5.

>
> NONE of those Photoshop operations are bandwidth limited.

I have bus traces and benchmarks that prove otherwise.
As do all the major processor and system manufacturers who use Photoshop as
a benchmark.

>
> 10 meg of image, which can't be shunted from RAM to processor in the
> 18 seconds to perform the filter? In one second even? That is a slow
> bus.

Unless the filter operation is iterative.....

>
> Assuming a pessimistic bandwidth of 100MB a sec, and you begin to see
> that bus speed and RAM speed are the least of our problems.

That's a bus bandwidth for what?
a read?
a write?
a read-modify-write?
how many streams?
how many DRAM banks are available?
how many DRAM pages are open?
how many ways are available in the cache for those streams?

Yes, for you DRAM speed is not a problem -- but education appears to be.

> If Photoshop were bandwidth limited, then the CPU would be starved and
> all operations would take a minimum time.

Please take a few classes and do some experimentation before you start
lecturing on things you have no idea about.

>
> How do you then explain that a 90 degree rotate of this image takes
> 0.3 of a second, yet a water color filter takes 27.3 seconds, and the
> BANDWIDTH stopped the second CPU having a crack at it?

WaterColor is probably not bandwidth limited.
I didn't say that EVERY operation is bandwidth limited.
And given that WaterColor is nearly useless (and unused by professionals),
I doubt that it has been optimized at all.

>
> Most 2D and 3D graphics operations are "severely" CPU limited.

No, most 2D graphics are bandwidth limited.
And some (Electric Image comes to mind) 3D graphics programs are bandwidth
limited.

> Raster manipulation of bitmaps in 2D graphics and video effects is
> both CPU bound and a very good candidate for SMP. I am not saying
> Adobe can't write good code. I am saying they haven't bothered.

Sorry, they write good code, and they do bother to optimize all of the
common operations and many of the user-requested operations (those where
they say "we'd like to use it, but it's too slow").

> If you had ever seen Digital Fusion in action you would never accuse
> them of writing poor code.

If your benchmarks are accurate, then I don't need to see it in action -- I
know that it is pretty close to CPU limited.

>
> If Digital Fusion is using 100% of two CPUs because of inefficient
> code, tell me how I get a 199% speedup on a dual system?
>
> Be creative, we need another laugh.

Your creative math (or is it just a complete lack of facts) shown above
already has us rolling on the floor.
If their code wasn't so CPU limited, then you wouldn't get anywhere close
to 2X speed on a 2 processor system. Since even ray-tracing with spectral
colors and advanced geometry has been brough close to bandwidth limits, I
would have to guess that Digital Fusion doesn't have great optimization (or
that your benchmarks are inaccurate).

>
> >Please check your facts before showing your ignorance in a public forum.
>

> Sound, sound advice...

Which I wish you would consider.

Chris

David T. Wang

unread,

Feb 19, 1999, 3:00:00 AM2/19/99

to

Chris Cox (cc...@slip.net) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

: > Chris Cox (cc...@slip.net) wrote:
: > : In article <36da9b8c...@10.1.1.9>, dav...@ozramp.net.au.despam-me!
: > : (davexl) wrote:
: >
: > : > I am sick to death of this misinformation.
: > : >
: > : > Photoshop is NOT a good example of Symmetric Multi-Processing. In fact
: > : > hardly any of it IS SMP. See table below.
: > : >
: > : > I ran the infamous Photoshop 5 action file on a dual Celeron 450 in
: > : > one and two CPU configuration. To summarise: SMP is implemented so
: > : > badly that the difference on this test was 7%. Yes, seven whole
: > : > percent. Only five functions in this test were multithreaded. This is
: > : > not a biased example of choosing functions either - notice that the
: > : > few that are SMP are counted multiple times.
: >
: > : Please check your facts before showing your ignorance in a public forum.
: >
: > : The biggest problem is that many of the operations you tested are bandwidth
: > : limited. If you add more processors, they're still limited by how fast the
: > : system can move data to and from DRAM. If you have multiple processors
: > : trying to do something that is limited by DRAM bandwidth, it may actually
: > : go SLOWER than having a single processor working on that task (unless
: > : you've got one heck of a big crossbar for a system bus like certain
: > : workstations and supercomputers).

: >
: > Please point out the operations which are bandwidth limited, then we

: > can check against some of the numbers.

: I can't do that (too much proprietary info) -- but I can tell you that most
: of the common operations in Photoshop are bandwidth limited on current
: desktop systems.

But then we basically have your word against his. No disrespect intended,
but I don't accept conclusions just because it came from an authoritative
source, I'd like to think about it for myself, and be able to check the
data if I could. In this case, there is a big questionmark in my head
with respect to your assertion, see below.

: >There are Pentium II's

: > with 66 Mhz and 100 MHz busses, and Macs with 50/66/100 Mhz
: > memory busses. With publically available data, we can see the amount
: > of improvement that increased memory bandwidth has brought.
: > (e.g. there should be little difference between the 350/400/450 MHz
: > Pentium II's, but big one between the 333 and 350 MHz Pentium II.
: > Similar deltas should exist for the Macintosh platform)

: It's not that simple.

The how do we test for the bandwidth limitation of these Photoshop operations?
Can we compare to STREAM, or some other ways of measuring or estimating
bandwidth?

: > With this data, we can better decide if your assertion of memory-bandwidth

: > saturation or Mr Davexl's assertion of poor threading is the bottleneck
: > for those specific functions in dispute.

: Read the credits in Photoshop 4 or 5.
: Then decide.

Sorry, credentials are nice, but I'd like to see some real numbers.

The reason is simple. Macs have long performed well on Photoshop,
despite always having inferior memory bandwidth. (both theoretical,
as indicated by raw MHz numbers, and measured, as indicated by STREAM)
Thus it is a surprise to hear that indeed, most photoshop operations
are saturated by memory bandwidth. If that is the case, my thinking
is that most PC's should've long obliterated Macs in Photoshop, especially
when PC's long shipped with 66 MHz memory busses, and Macs shipped with
45/50 MHz busses. Yet the reverse is true. Why?

: Chris

: PS. Photoshop's MP code was originally written for NT, not Macintosh.

--

David T. Wang

unread,

Feb 19, 1999, 3:00:00 AM2/19/99

to

Chris Cox (cc...@slip.net) wrote:
: In article <36e1cbd8....@10.1.1.9>, dav...@ozramp.net.au.despam-me!
: (davexl) wrote:

: > NONE of those Photoshop operations are bandwidth limited.

: I have bus traces and benchmarks that prove otherwise.
: As do all the major processor and system manufacturers who use Photoshop as
: a benchmark.

With regards to your previous post, and your contention that you have done
real benchmarking to have discovered that "many" common PS operations
are limited by memory bandwidth. I do understand your desire not to
reveal proprietary information, but is there a way you can at least
generalize the behavior of various operations, and explain why PC's
continue to perform poorly relative to Macs, despite having better
memory bandwidth? Something in my head doesn't quite jive properly,
and if for no other reason than to erase the questionmark, is there
a way you can provide more information on this topic?

I'm always glad to learn something new.

Jerry Kindall

unread,

Feb 19, 1999, 3:00:00 AM2/19/99

to

In article <7aj74r$g5n$2...@hecate.umd.edu>,

dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> With regards to your previous post, and your contention that you have done
> real benchmarking to have discovered that "many" common PS operations
> are limited by memory bandwidth. I do understand your desire not to
> reveal proprietary information, but is there a way you can at least
> generalize the behavior of various operations, and explain why PC's
> continue to perform poorly relative to Macs, despite having better
> memory bandwidth? Something in my head doesn't quite jive properly,
> and if for no other reason than to erase the questionmark, is there
> a way you can provide more information on this topic?

It _may_ have to do with the Mac's (G3's) cache architecture. The G3 Mac
has a separate bus for its L2 cache, and that bus can run at a
significantly higher speed than the system bus. (For example, some 300
MHz G3 upgrades allow you to run the cache itself at 200 or even 300 MHz,
though Apple's own machines typically run the cache at half the speed of
the processor.) I'm not familiar enough with PC architecture, however, to
know whether the same technique is used in Pentia.

mac...@pobox.com

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

Well, the Pentium Pro runs its cache at full processor speed and so
does the Xeon and the Celeron (excepting the 266 and 300 models
already discontinued). The PII runs it at half processor speed. I do
not thing the problem lies here. Latency .........? Memory
arquitecture ........? Floting and integer .....?

Francisco Garcia Maceda
mac...@pobox.com

Chris Cox

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

In article <7aj74r$g5n$2...@hecate.umd.edu>,
dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> Chris Cox (cc...@slip.net) wrote:
> : In article <36e1cbd8....@10.1.1.9>, dav...@ozramp.net.au.despam-me!
> : (davexl) wrote:
>
> : > NONE of those Photoshop operations are bandwidth limited.
>
> : I have bus traces and benchmarks that prove otherwise.
> : As do all the major processor and system manufacturers who use Photoshop as
> : a benchmark.
>

> With regards to your previous post, and your contention that you have done
> real benchmarking to have discovered that "many" common PS operations
> are limited by memory bandwidth. I do understand your desire not to
> reveal proprietary information, but is there a way you can at least
> generalize the behavior of various operations, and explain why PC's
> continue to perform poorly relative to Macs, despite having better
> memory bandwidth? Something in my head doesn't quite jive properly,
> and if for no other reason than to erase the questionmark, is there
> a way you can provide more information on this topic?

I think I'm going to have to write some new benchmarks to show a lot of this.
Some of it is counter intuitive unless you've spent a lot of time working
on low level optimization or computer bus circuitry.

Chris

Chris Cox

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

In article <kindall-1902...@nic-c50-203.mw.mediaone.net>,
kin...@mail.manual.com (Jerry Kindall) wrote:

> In article <7aj74r$g5n$2...@hecate.umd.edu>,
> dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
> Wang) wrote:
>
> > With regards to your previous post, and your contention that you have done
> > real benchmarking to have discovered that "many" common PS operations
> > are limited by memory bandwidth. I do understand your desire not to
> > reveal proprietary information, but is there a way you can at least
> > generalize the behavior of various operations, and explain why PC's
> > continue to perform poorly relative to Macs, despite having better
> > memory bandwidth? Something in my head doesn't quite jive properly,
> > and if for no other reason than to erase the questionmark, is there
> > a way you can provide more information on this topic?
>

> It _may_ have to do with the Mac's (G3's) cache architecture. The G3 Mac
> has a separate bus for its L2 cache, and that bus can run at a
> significantly higher speed than the system bus. (For example, some 300
> MHz G3 upgrades allow you to run the cache itself at 200 or even 300 MHz,
> though Apple's own machines typically run the cache at half the speed of
> the processor.) I'm not familiar enough with PC architecture, however, to
> know whether the same technique is used in Pentia.

Even the older PPC machines still outperformed PC's with theoretically
better busses. It has more to do with the CPU _and_ the motherboard
architecture.
And yes, the same basic L1 and L2 cache techniques are used in the Pentia
as in the PowerPC processors.

Chris

Chris Cox

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

In article <7aj61k$g5n$1...@hecate.umd.edu>,

dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> But then we basically have your word against his. No disrespect intended,
> but I don't accept conclusions just because it came from an authoritative
> source, I'd like to think about it for myself, and be able to check the
> data if I could. In this case, there is a big questionmark in my head
> with respect to your assertion, see below.

Sure, it's my word against his.
I wrote much of the code in question, I benchmarked it, I studied how it
interacts with the system software and hardware, I work with the chip and
board designers, and I have a background in performance optimization at the
lowest level (ie: I start from the silicon and work my way up).
And he has....bad guesswork.
Yeah, I could see how you'd take both opinions equally.

> : >There are Pentium II's
> : > with 66 Mhz and 100 MHz busses, and Macs with 50/66/100 Mhz
> : > memory busses. With publically available data, we can see the amount
> : > of improvement that increased memory bandwidth has brought.
> : > (e.g. there should be little difference between the 350/400/450 MHz
> : > Pentium II's, but big one between the 333 and 350 MHz Pentium II.
> : > Similar deltas should exist for the Macintosh platform)
>
> : It's not that simple.
>
> The how do we test for the bandwidth limitation of these Photoshop operations?
> Can we compare to STREAM, or some other ways of measuring or estimating
> bandwidth?

No, because the operations in STREAM are too simplistic to model all but
the simplest Photoshop operations. (see side note below)

> : > With this data, we can better decide if your assertion of memory-bandwidth
> : > saturation or Mr Davexl's assertion of poor threading is the bottleneck
> : > for those specific functions in dispute.
>
> : Read the credits in Photoshop 4 or 5.
> : Then decide.
>
> Sorry, credentials are nice, but I'd like to see some real numbers.

Sorry, company policy is not to release numbers.

> The reason is simple. Macs have long performed well on Photoshop,
> despite always having inferior memory bandwidth. (both theoretical,
> as indicated by raw MHz numbers, and measured, as indicated by STREAM)

Inferior for SOME things.
That's why I asked about the number of open DRAM pages and the number of
active streams.
Macintoshes bandwidth has usually scaled well with increased streams, PC
systems have not. So if I have 2 machines with a peak 200MB/s bandwidth,
the Macintosh will get around 85% of that when reading and writing 7
streams while the PC only gets around 60%. And that's with identical
SIMMs.
A PC might do a memcpy faster... but how about a blend involving 8 input
channels, 4 output channels, 1 mask, 2 lookup tables and the code itself?

SIDE NOTE: Is STREAM really measuring bandwidth, or the speed of FP ops
plus bandwidth? Also, STREAM can be optimized to nop's, so it's not a
great test overall.

> Thus it is a surprise to hear that indeed, most photoshop operations
> are saturated by memory bandwidth.

Not most -- most COMMON operations.
Do you really think I spend that much time optimizing things that nobody uses?

> If that is the case, my thinking
> is that most PC's should've long obliterated Macs in Photoshop, especially
> when PC's long shipped with 66 MHz memory busses, and Macs shipped with
> 45/50 MHz busses. Yet the reverse is true. Why?

Because raw bandwidth isn't everything.
And because while clock speed is nice, latency and emulation of an ancient
architecture kill performance.

Chris

Thingfishhhh

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

In article <ccox-ya02408000R...@news.slip.net>, cc...@slip.net
(Chris Cox) wrote:

> Read the credits in Photoshop 4 or 5.
> Then decide.
>
>

> Chris

"Last minute tweaks", huh?

Damn - I'm impressed with anyone who gets splash screen credit for
Photoshop, but then I am obsessed with it...

Thingfishhhh

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

In article <7aj61k$g5n$1...@hecate.umd.edu>,
dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> But then we basically have your word against his. No disrespect intended,
> but I don't accept conclusions just because it came from an authoritative
> source, I'd like to think about it for myself, and be able to check the
> data if I could. In this case, there is a big questionmark in my head
> with respect to your assertion, see below.

Sheesh! The guy wrote the code! What do you want, the numbers written in
the sky in fire by flying monkeys?

> The reason is simple. Macs have long performed well on Photoshop,
> despite always having inferior memory bandwidth. (both theoretical,
> as indicated by raw MHz numbers, and measured, as indicated by STREAM)

> Thus it is a surprise to hear that indeed, most photoshop operations

> are saturated by memory bandwidth. If that is the case, my thinking

> is that most PC's should've long obliterated Macs in Photoshop, especially
> when PC's long shipped with 66 MHz memory busses, and Macs shipped with
> 45/50 MHz busses. Yet the reverse is true. Why?

It's a big conspiracy. Now you have to die, because you've dicovered the truth.

The Rosicrucians are *not* happy.

Roy Anderson

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

Thingfishhhh wrote:
>
> In article <7aj61k$g5n$1...@hecate.umd.edu>,
> dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
> Wang) wrote:
>
> > But then we basically have your word against his. No disrespect intended,
> > but I don't accept conclusions just because it came from an authoritative
> > source, I'd like to think about it for myself, and be able to check the
> > data if I could. In this case, there is a big questionmark in my head
> > with respect to your assertion, see below.
>
> Sheesh! The guy wrote the code! What do you want, the numbers written in
> the sky in fire by flying monkeys?
>

I'd pay to see that!

blah

unread,

Feb 20, 1999, 3:00:00 AM2/20/99

to

In article <kindall-1902...@nic-c50-203.mw.mediaone.net>,
kin...@mail.manual.com (Jerry Kindall) wrote:

> It _may_ have to do with the Mac's (G3's) cache architecture. The G3 Mac
> has a separate bus for its L2 cache, and that bus can run at a
> significantly higher speed than the system bus. (For example, some 300
> MHz G3 upgrades allow you to run the cache itself at 200 or even 300 MHz,
> though Apple's own machines typically run the cache at half the speed of
> the processor.) I'm not familiar enough with PC architecture, however, to
> know whether the same technique is used in Pentia.

Cache architecture in Intel systems are similar. A separate bus for the
L2 cache.

Jerry Kindall

unread,

Feb 21, 1999, 3:00:00 AM2/21/99

to

In article <7aqbal$1oq$1...@hecate.umd.edu>,

dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> ino, I would simple like an explanation. Sometimes, I'd like to learn a
> thing or two. If you simply "believe" based on authority, then you'll
> never learn anything new. Why is the world round? because I say it is?

Don't tell me you have personally seen that the world is round.

Most of us are content to take the word of the guy who saw it and brought
back the pictures.

Trevor Zion Bauknight

unread,

Feb 21, 1999, 3:00:00 AM2/21/99

to

In article <7aqbal$1oq$1...@hecate.umd.edu>,
dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> : The Rosicrucians are *not* happy.
>
> Send the death squad.

I wouldn't joke about the Knights of the Rosy Cross, my friend...you never
know when you'll be next to one at the red light.

Trev

David T. Wang

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

Thingfishhhh (Thingf...@yahoo.com) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

: > But then we basically have your word against his. No disrespect intended,

: > but I don't accept conclusions just because it came from an authoritative
: > source, I'd like to think about it for myself, and be able to check the
: > data if I could. In this case, there is a big questionmark in my head
: > with respect to your assertion, see below.

: Sheesh! The guy wrote the code! What do you want, the numbers written in
: the sky in fire by flying monkeys?

ino, I would simple like an explanation. Sometimes, I'd like to learn a

thing or two. If you simply "believe" based on authority, then you'll
never learn anything new. Why is the world round? because I say it is?

There is a reason for most anything, and if you stopped for just a second,
you'll see that the explanation given really is counter intuative, and
if an explanation may be given, we may all learn a thing or two.

: > The reason is simple. Macs have long performed well on Photoshop,

: > despite always having inferior memory bandwidth. (both theoretical,
: > as indicated by raw MHz numbers, and measured, as indicated by STREAM)
: > Thus it is a surprise to hear that indeed, most photoshop operations
: > are saturated by memory bandwidth. If that is the case, my thinking
: > is that most PC's should've long obliterated Macs in Photoshop, especially
: > when PC's long shipped with 66 MHz memory busses, and Macs shipped with
: > 45/50 MHz busses. Yet the reverse is true. Why?

: It's a big conspiracy. Now you have to die, because you've dicovered the truth.

: The Rosicrucians are *not* happy.

Send the death squad.

--

David T. Wang

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

Chris Cox (cc...@slip.net) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

: > But then we basically have your word against his. No disrespect intended,
: > but I don't accept conclusions just because it came from an authoritative
: > source, I'd like to think about it for myself, and be able to check the
: > data if I could. In this case, there is a big questionmark in my head
: > with respect to your assertion, see below.

: Sure, it's my word against his.

: I wrote much of the code in question, I benchmarked it, I studied how it
: interacts with the system software and hardware, I work with the chip and
: board designers, and I have a background in performance optimization at the
: lowest level (ie: I start from the silicon and work my way up).
: And he has....bad guesswork.
: Yeah, I could see how you'd take both opinions equally.

Sorry, its the nature of internet. Any one can get on a soap box
and shout out his/her opinions, and often pass those opinion off as facts.
I've found it best to challenge based on the information given, not on
credentials alone.

: > : >There are Pentium II's

: > : > with 66 Mhz and 100 MHz busses, and Macs with 50/66/100 Mhz
: > : > memory busses. With publically available data, we can see the amount
: > : > of improvement that increased memory bandwidth has brought.
: > : > (e.g. there should be little difference between the 350/400/450 MHz
: > : > Pentium II's, but big one between the 333 and 350 MHz Pentium II.
: > : > Similar deltas should exist for the Macintosh platform)
: >
: > : It's not that simple.
: >
: > The how do we test for the bandwidth limitation of these Photoshop operations?
: > Can we compare to STREAM, or some other ways of measuring or estimating
: > bandwidth?

: No, because the operations in STREAM are too simplistic to model all but
: the simplest Photoshop operations. (see side note below)

: > : > With this data, we can better decide if your assertion of memory-bandwidth
: > : > saturation or Mr Davexl's assertion of poor threading is the bottleneck
: > : > for those specific functions in dispute.
: >

: > : Read the credits in Photoshop 4 or 5.
: > : Then decide.
: >
: > Sorry, credentials are nice, but I'd like to see some real numbers.

: Sorry, company policy is not to release numbers.

fair enough. I just wanted to erase the questionmark.

: > The reason is simple. Macs have long performed well on Photoshop,
: > despite always having inferior memory bandwidth. (both theoretical,
: > as indicated by raw MHz numbers, and measured, as indicated by STREAM)

: Inferior for SOME things.

: That's why I asked about the number of open DRAM pages and the number of
: active streams.
: Macintoshes bandwidth has usually scaled well with increased streams, PC
: systems have not. So if I have 2 machines with a peak 200MB/s bandwidth,
: the Macintosh will get around 85% of that when reading and writing 7
: streams while the PC only gets around 60%. And that's with identical
: SIMMs.
: A PC might do a memcpy faster... but how about a blend involving 8 input
: channels, 4 output channels, 1 mask, 2 lookup tables and the code itself?

I'm not sure I understand. You now have separate data and code sections
which you must actively access to process the data.

Datain section0, section1, section2 ...
Dataout section0, section1, section2 ...
mask section0
LUT section0, section1.
code section0

Basically, you have to interleave the memory access patterns to properly
evaluate every single data point, is this correct? So what you are
saying is that "PC's" do poorly in sequences like this .. ?

Grab two chunk from Datain section 0, 1 chunk from section 1, 2 chunks from
section 2... 1 chunk from mask .... crunch data (somewhere in the middle)
shove stuff back to Dataout section 0....

If this is correct, then it would appear almost as entirely random access
to the processor, and you're latency bound. (yes/no?)

: SIDE NOTE: Is STREAM really measuring bandwidth, or the speed of FP ops

: plus bandwidth? Also, STREAM can be optimized to nop's, so it's not a
: great test overall.

This has been discussed before, as I took a discussion in CSMA to
comp.benchmarks, and Dr McCalpin himself answered. Basically, on
modern micros, computational latency is entirely irrelevent, his reply
to my post may be found here.

http://x13.dejanews.com/[ST_rn=ps]/getdoc.xp?AN=387549839&CONTEXT=919646870.175374399&hitnum=0

(If the URL doesn't work, use dejanews power search, search for subject
STREAM, forum comp.benchmarks, Author McCalpin, date between September 1, 1998
and Spetember 10, 1998)

: > Thus it is a surprise to hear that indeed, most photoshop operations
: > are saturated by memory bandwidth.

: Not most -- most COMMON operations.

: Do you really think I spend that much time optimizing things that nobody uses?

no. I've never even use Photoshop. ( except maybe once or twice. used it to
look at other people's work) My interest is the hardware underneath.
Photoshop because it gets passed off as a "processor benchmark" so often.
If it is "processor intensive", I'd like to know how it uses the processor.

: > If that is the case, my thinking

: > is that most PC's should've long obliterated Macs in Photoshop, especially
: > when PC's long shipped with 66 MHz memory busses, and Macs shipped with
: > 45/50 MHz busses. Yet the reverse is true. Why?

: Because raw bandwidth isn't everything.

: And because while clock speed is nice, latency and emulation of an ancient
: architecture kill performance.

What kind of features would you like to see on an "x86" memory subsystem?
Latency hits everyone equally badly, but what kind of resources are you,
the programmer, desperate for more of?

: Chris

David T. Wang

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

Jerry Kindall (kin...@mail.manual.com) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

: > ino, I would simple like an explanation. Sometimes, I'd like to learn a

: > thing or two. If you simply "believe" based on authority, then you'll
: > never learn anything new. Why is the world round? because I say it is?

: Don't tell me you have personally seen that the world is round.

: Most of us are content to take the word of the guy who saw it and brought
: back the pictures.

Ah, but looking at the pictures (data) would be better than just listening
to him/her? It would be best to listen, and to see. (short of flying
up into space and look back down)

: --

: Jerry Kindall mailto:kin...@mail.manual.com Technical Writing
: Manual Labor http://www.manual.com We wrote the book!

: "During your times of trial and suffering, when you see only one
: set of footprints, it was then that I was riding the pogo stick."

--

Ken Lui

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

In article <ccox-ya02408000R...@news.slip.net>,

Chris Cox <cc...@slip.net> wrote:
>Sure, it's my word against his.
>I wrote much of the code in question, I benchmarked it, I studied how it
>interacts with the system software and hardware, I work with the chip and
>board designers, and I have a background in performance optimization at the
>lowest level (ie: I start from the silicon and work my way up).
>And he has....bad guesswork.
>Yeah, I could see how you'd take both opinions equally.

This is classic. But then this is csma, where people who yell the
loudest or post the most have authority and "respect."

>Because raw bandwidth isn't everything.
>And because while clock speed is nice, latency and emulation of an ancient
>architecture kill performance.

Ken
--
Ken Lui 19111 Pruneridge Avenu
kl...@cup.hp.com Cupertino, CA 95014-0795 USA
Information Solutions & Services 1.408.447.3230 FAX 1.408.447.0218
Views within this message may not be those of the Hewlett-Packard Company

Frank DeCarlo

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

Trevor Zion Bauknight <tr...@sc.edu> wrote in message news:trev-21029...@col-pm3-145.innova.net...
> In article <7aqbal$1oq$1...@hecate.umd.edu>,

> dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
> Wang) wrote:
>

> > : The Rosicrucians are *not* happy.
> >
> > Send the death squad.
>

> I wouldn't joke about the Knights of the Rosy Cross, my friend...you never
> know when you'll be next to one at the red light.

ROFL Rosy Rage?

Joe Ragosta

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

In article <36e20244....@news.oasistech.com>, BoBli...@loa.con wrote:

> Christian Bau <christ...@isltd.insignia.com> wrote:
>
> >So the magic bullet SMP only works if the application writer
> >cooperates???
>
> Not cooperates, tells the processors how to break up the task(s).
>
> >Just like multiprocessing on the Macintosh works if the
> >application writer cooperates???
>
> Are Mac users saying the OS is smart enough to know what all the
applications
> are trying to do and to break up the tasks to maximize the process?

No. Macs work just the same way.

But Mac users haven't been claiming that the Mac's MP was superior.
Instead, you have the Wintrolls running around yapping about how bad
multiprocessing is on Macs and how NT is _so_ much better.

The above discussion is meant to show that they're wrong.

--
Regards,

Joe Ragosta

Thingfishhhh

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

In article <trev-21029...@col-pm3-145.innova.net>, tr...@sc.edu
(Trevor Zion Bauknight) wrote:

> In article <7aqbal$1oq$1...@hecate.umd.edu>,
> dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
> Wang) wrote:
>
> > : The Rosicrucians are *not* happy.
> >
> > Send the death squad.
>
> I wouldn't joke about the Knights of the Rosy Cross, my friend...you never
> know when you'll be next to one at the red light.
>

> Trev

And *NEVER* mention the Spanish Inquisition....

Thingfishhhh

unread,

Feb 22, 1999, 3:00:00 AM2/22/99

to

In article <7aqbal$1oq$1...@hecate.umd.edu>,
dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> Thingfishhhh (Thingf...@yahoo.com) wrote:
> : dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.

> : Wang) wrote:
>
> : > But then we basically have your word against his. No disrespect intended,
> : > but I don't accept conclusions just because it came from an authoritative
> : > source, I'd like to think about it for myself, and be able to check the
> : > data if I could. In this case, there is a big questionmark in my head
> : > with respect to your assertion, see below.
>

> : Sheesh! The guy wrote the code! What do you want, the numbers written in
> : the sky in fire by flying monkeys?
>

> ino, I would simple like an explanation. Sometimes, I'd like to learn a
> thing or two. If you simply "believe" based on authority, then you'll
> never learn anything new. Why is the world round? because I say it is?

> There is a reason for most anything, and if you stopped for just a second,
> you'll see that the explanation given really is counter intuative, and

> if an explanation may be given, we may all learn a thing or two.

I'm no sheep, sir. And please keep theology out of this NG, you may scare
the Ubergeeks with "intangibles". One cannot benchmark such things.
Answering the great questions of life do not make Quake faster.

And, I was *JOKING*! Sheesh! I was laughing my ASS off! I mean, how much
more hilarious could it get?

"Who do you think you are, the guy who wrote the code?"

"Uh, yeah. I am".

LOL!

Relax. I was teasing you.

I think if you behave and keep playing nice with our new friend, he *will*
give you answers, even if they're not the ones you expect or are happy
with. (Within legal limits, o'course)

> : It's a big conspiracy. Now you have to die, because you've dicovered
the truth.
>

> : The Rosicrucians are *not* happy.
>
> Send the death squad.

"Biggles! Get me the.....comfy chair!" (obligatory Monty Python reference)

Chris Cox

unread,

Feb 23, 1999, 3:00:00 AM2/23/99

to

In article <7aqdra$1oq$2...@hecate.umd.edu>,

More correctly: Intel based systems do poorly on things like this.

>
> Grab two chunk from Datain section 0, 1 chunk from section 1, 2 chunks from
> section 2... 1 chunk from mask .... crunch data (somewhere in the middle)
> shove stuff back to Dataout section 0....
>
> If this is correct, then it would appear almost as entirely random access
> to the processor, and you're latency bound. (yes/no?)

Yes. But in different situations you can be bound by the write buffers, L1
latency, L2 latency, L2 bandwidth, DRAM latency, the DRAM plus memory
controller latency, bus bandwidth, etc.
But yeah, I'd say that to most processors Photoshop's memory access looks
kinda random.

> : > Thus it is a surprise to hear that indeed, most photoshop operations
> : > are saturated by memory bandwidth.
>
> : Not most -- most COMMON operations.
> : Do you really think I spend that much time optimizing things that
nobody uses?
>
> no. I've never even use Photoshop. ( except maybe once or twice. used it to
> look at other people's work) My interest is the hardware underneath.
> Photoshop because it gets passed off as a "processor benchmark" so often.
> If it is "processor intensive", I'd like to know how it uses the processor.

Some parts of it are very computation bound, but the most common parts are
bandwidth bound. CPU speed is still outpacing bandwidth, so it's likely to
get even more bandwidth bound as time goes on.

> : > If that is the case, my thinking
> : > is that most PC's should've long obliterated Macs in Photoshop, especially
> : > when PC's long shipped with 66 MHz memory busses, and Macs shipped with
> : > 45/50 MHz busses. Yet the reverse is true. Why?
>
> : Because raw bandwidth isn't everything.
> : And because while clock speed is nice, latency and emulation of an ancient
> : architecture kill performance.
>
> What kind of features would you like to see on an "x86" memory subsystem?
> Latency hits everyone equally badly, but what kind of resources are you,
> the programmer, desperate for more of?

The one's I've been describing to the chipset and CPU manufacturers :-).
It's hard to pin down specifics (especially here where I have no idea of
the reader's background) -- but most I'm just trying to get larger caches,
higher bandwidth and lower latencies.

BTW - latency does not hit everyone equally badly.
If you're doing a simple operation on a few streams of data, the CPU or
your own cache hints (prefetches, whatever) can hide the latency very well.
It's only when you have 'random' access patterns that latency comes into
play, or when you have enough streams that the caches thrash and the access
starts to look random.

I think I've now got a benchmark that'll show some of these complex
interactions (in a non-proprietary way :-). Now I've got to do some
serious testing of it and see if I can actually explain the results on
different systems -- because the raw numbers are FAR from intuitive. If my
initial tests are any indication, I might even be able to extract some
interesting system statistics from it.

Chris

David T. Wang

unread,

Feb 24, 1999, 3:00:00 AM2/24/99

to

Chris Cox (cc...@slip.net) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

: > I'm not sure I understand. You now have separate data and code sections

: > which you must actively access to process the data.
: >
: > Datain section0, section1, section2 ...
: > Dataout section0, section1, section2 ...
: > mask section0
: > LUT section0, section1.
: > code section0
: >
: > Basically, you have to interleave the memory access patterns to properly
: > evaluate every single data point, is this correct? So what you are
: > saying is that "PC's" do poorly in sequences like this .. ?

: More correctly: Intel based systems do poorly on things like this.

AMD's system's are better for this?

Techweb had a story that showed an AMD K-6 system which was only 5% slower in
Photoshop than a more expensive PPC based Mac. If this is the dominant speed
limitation, and AMD's memory subsystem is better than Intels P II based
subsystem, this would be very interesting indeed.

: > Grab two chunk from Datain section 0, 1 chunk from section 1, 2 chunks from

: > section 2... 1 chunk from mask .... crunch data (somewhere in the middle)
: > shove stuff back to Dataout section 0....
: >
: > If this is correct, then it would appear almost as entirely random access
: > to the processor, and you're latency bound. (yes/no?)

: Yes. But in different situations you can be bound by the write buffers, L1

: latency, L2 latency, L2 bandwidth, DRAM latency, the DRAM plus memory
: controller latency, bus bandwidth, etc.
: But yeah, I'd say that to most processors Photoshop's memory access looks
: kinda random.

: > no. I've never even use Photoshop. ( except maybe once or twice. used it to

: > look at other people's work) My interest is the hardware underneath.
: > Photoshop because it gets passed off as a "processor benchmark" so often.
: > If it is "processor intensive", I'd like to know how it uses the processor.

: Some parts of it are very computation bound, but the most common parts are

: bandwidth bound. CPU speed is still outpacing bandwidth, so it's likely to
: get even more bandwidth bound as time goes on.

I am guessing that the "bandwidth bound" part maybe just a side effect of cache
thrashing the L1. Still, I am surprised that it would make it all the way
out to main memory. Maybe it's thrashing L2 also.

: > : Because raw bandwidth isn't everything.

: > : And because while clock speed is nice, latency and emulation of an ancient
: > : architecture kill performance.
: >
: > What kind of features would you like to see on an "x86" memory subsystem?
: > Latency hits everyone equally badly, but what kind of resources are you,
: > the programmer, desperate for more of?

: The one's I've been describing to the chipset and CPU manufacturers :-).

: It's hard to pin down specifics (especially here where I have no idea of
: the reader's background) -- but most I'm just trying to get larger caches,
: higher bandwidth and lower latencies.

I have bit of hardware VLSI background. It's okay to talk about cache
associativity, latency and BW numbers with me.

: BTW - latency does not hit everyone equally badly.

: If you're doing a simple operation on a few streams of data, the CPU or
: your own cache hints (prefetches, whatever) can hide the latency very well.
: It's only when you have 'random' access patterns that latency comes into
: play, or when you have enough streams that the caches thrash and the access
: starts to look random.

Cache hints are a "good thing" (tm), and I think you'll get that on every
architecture soon, but that won't alleviate what you are seeing I think.

For the most part, I think your memory access patterns look totally random
(from a localized point of view) However, I can see how you may be helped
by a large and highly associative L1. I'm thinking that each "chunk" of
your data stream is probably only 8, 16 or even 32 bit wide, but it would
cost you the time period for a cache line fetch, and in a CPU with L1 cache
associativity and size is small, the cacheline may be replaced before the
next iteration of your loop comes around (especially if you have stream_count
>> associativity) This means that the L1 cache would start to thrash a lot
faster than another CPU with a larger and more associative L1 cache, and you
appear to be memory bound, because you end up fetching the same data
multiple times.

Obviously I'm reaching here with a 10 foot pole, guessing as to the exact
cause of penalty your code streams, even without looking at them, but I've ran
it through my mind a few times, but this senario would make more sense
to me than other things. As you don't appear to be register starved or
FPU stack bound, but main memory bound, and on an Intel box with more bandwidth
to main memory, you seem to starving for more, and ironically, on systems
based with AMD processors, with are slower for many other things, Photoshop
seems to run well.

I think I can setup a few tests for this senario.
The P54-200 to P55-200 comparison should be interesting, as the basic architecture
didn't change much, and it doesn't seem that your code would benefit from an
improved branch predictor, but as long as I stay away from the "MMX enhanced",
but bandwidth limited applications, this should be dramatic as the number of
streams being mixed increases.

: I think I've now got a benchmark that'll show some of these complex

: interactions (in a non-proprietary way :-). Now I've got to do some
: serious testing of it and see if I can actually explain the results on
: different systems -- because the raw numbers are FAR from intuitive. If my
: initial tests are any indication, I might even be able to extract some
: interesting system statistics from it.

Please share it if you could. We can even take the discussion to email if
you like. I'll even sign an NDA to look more into the details and exact
nature of the problem if needed.

PHOTOGA

unread,

Feb 25, 1999, 3:00:00 AM2/25/99

to

BULL

INFINiTiML

unread,

Feb 25, 1999, 3:00:00 AM2/25/99

to

The BX motherboard runs at 100 Mhz, the macs have such a speed difference as
well, but motherboard speeds are very delicate, and motherboard design is
harder because high speed motherboards poorly designed experience a problem
where the electrons completely run off of the printed track onto another track.
Anyway, the entire system determines speed, usually by the slowest part, but
not always. mine for instance, with 96 megs of 100 MHz ram, would be faster
then 96 megs of 66MHz ram, however, someone would probably outrun me in
photoshop with 128 EDO.

>There are only two things that govern speed, how quickly the hardware gets
>data
>into and out of the processor works and the OS overhead to get an App in and
>out of
>the processor. Intel machines make such an issue of hundreds of MHz, now up
>to 450MHz
>for consumers, and still sitting on a 75MHz buss. (Which might go to 100MHz
>with the
>coming PIII.) And the L2 cache and other look-ahead techniques are as
>faut-filled as
>multi-processing.
>
> I have no idea what is inside Macs these days but if there is a similar
>processor/buss mismatch there is the same problem.

Chris Cox

unread,

Feb 27, 1999, 3:00:00 AM2/27/99

to

In article <7b0b1g$v0b$3...@hecate.umd.edu>,

dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> Chris Cox (cc...@slip.net) wrote:
> : dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
> : Wang) wrote:
>
> : > I'm not sure I understand. You now have separate data and code sections
> : > which you must actively access to process the data.
> : >
> : > Datain section0, section1, section2 ...
> : > Dataout section0, section1, section2 ...
> : > mask section0
> : > LUT section0, section1.
> : > code section0
> : >
> : > Basically, you have to interleave the memory access patterns to properly
> : > evaluate every single data point, is this correct? So what you are
> : > saying is that "PC's" do poorly in sequences like this .. ?
>
> : More correctly: Intel based systems do poorly on things like this.
>
> AMD's system's are better for this?

I don't know -- I haven't tested them.

> : BTW - latency does not hit everyone equally badly.
> : If you're doing a simple operation on a few streams of data, the CPU or
> : your own cache hints (prefetches, whatever) can hide the latency very well.
> : It's only when you have 'random' access patterns that latency comes into
> : play, or when you have enough streams that the caches thrash and the access
> : starts to look random.
>
> Cache hints are a "good thing" (tm), and I think you'll get that on every
> architecture soon, but that won't alleviate what you are seeing I think.

It won't cure all of it, but they certainly help.

>
> For the most part, I think your memory access patterns look totally random
> (from a localized point of view) However, I can see how you may be helped
> by a large and highly associative L1. I'm thinking that each "chunk" of
> your data stream is probably only 8, 16 or even 32 bit wide, but it would
> cost you the time period for a cache line fetch, and in a CPU with L1 cache
> associativity and size is small, the cacheline may be replaced before the
> next iteration of your loop comes around (especially if you have stream_count
> >> associativity) This means that the L1 cache would start to thrash a lot
> faster than another CPU with a larger and more associative L1 cache, and you
> appear to be memory bound, because you end up fetching the same data
> multiple times.

That is the case on some CPUs.
I've got a few simulations where every memory access causes an L1 miss --
inside a very simple looking blend loop.

>
> Obviously I'm reaching here with a 10 foot pole, guessing as to the exact
> cause of penalty your code streams, even without looking at them, but I've ran
> it through my mind a few times, but this senario would make more sense
> to me than other things. As you don't appear to be register starved or
> FPU stack bound, but main memory bound, and on an Intel box with more
bandwidth
> to main memory, you seem to starving for more, and ironically, on systems
> based with AMD processors, with are slower for many other things, Photoshop
> seems to run well.

Think about the DRAM controller, the number of banks on the SIMMs, the bus
controller, etc.

>
> I think I can setup a few tests for this senario.
> The P54-200 to P55-200 comparison should be interesting, as the basic
architecture
> didn't change much, and it doesn't seem that your code would benefit from an
> improved branch predictor, but as long as I stay away from the "MMX enhanced",
> but bandwidth limited applications, this should be dramatic as the number of
> streams being mixed increases.
>
> : I think I've now got a benchmark that'll show some of these complex
> : interactions (in a non-proprietary way :-). Now I've got to do some
> : serious testing of it and see if I can actually explain the results on
> : different systems -- because the raw numbers are FAR from intuitive. If my
> : initial tests are any indication, I might even be able to extract some
> : interesting system statistics from it.
>
> Please share it if you could. We can even take the discussion to email if
> you like. I'll even sign an NDA to look more into the details and exact
> nature of the problem if needed.

I'm still working on it -- in between other things.
Right now all my time is spent trying to get some code ready for a, um, new
processor.

Chris

Rick Tan

unread,

Feb 28, 1999, 3:00:00 AM2/28/99

to

In article <ccox-ya02408000R...@news.slip.net>, cc...@slip.net
(Chris Cox) wrote:

You mean you're trying to get Photoshop to run on Palm Pilots?
If you can pull that off, you're king. ;)

David T. Wang

unread,

Mar 1, 1999, 3:00:00 AM3/1/99

to

Chris Cox (cc...@slip.net) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

I was setting up a simple loop mentally, where you can have 5 streams all marching
across memory at the same pace (stride), and they can easily cause the caching
subsystem to thrash. To optimize for this sort of thing, I think you have to
go in and trying to make sure they don't all march with the same stride, or
somehow pad the data so the dataset addresses don't align and cause the cache
to thrash before it has to. Obviously it all breaks when your stream count gets
high, but I think that if I had to do it, I'd have to code in assembly (unfortunately)
to keep as much of the data in L1 as I can.

: > Obviously I'm reaching here with a 10 foot pole, guessing as to the exact

: > cause of penalty your code streams, even without looking at them, but I've ran
: > it through my mind a few times, but this senario would make more sense
: > to me than other things. As you don't appear to be register starved or
: > FPU stack bound, but main memory bound, and on an Intel box with more
: bandwidth
: > to main memory, you seem to starving for more, and ironically, on systems
: > based with AMD processors, with are slower for many other things, Photoshop
: > seems to run well.

: Think about the DRAM controller, the number of banks on the SIMMs, the bus
: controller, etc.

I think that once you break the architectural assumptions designed into the
organization of the cache, forcing you out to memory constantly, then the
things you are seeing are just side effects, not the cause.

: > I think I can setup a few tests for this senario.

: > The P54-200 to P55-200 comparison should be interesting, as the basic
: architecture
: > didn't change much, and it doesn't seem that your code would benefit from an
: > improved branch predictor, but as long as I stay away from the "MMX enhanced",
: > but bandwidth limited applications, this should be dramatic as the number of
: > streams being mixed increases.
: >
: > : I think I've now got a benchmark that'll show some of these complex
: > : interactions (in a non-proprietary way :-). Now I've got to do some
: > : serious testing of it and see if I can actually explain the results on
: > : different systems -- because the raw numbers are FAR from intuitive. If my
: > : initial tests are any indication, I might even be able to extract some
: > : interesting system statistics from it.
: >
: > Please share it if you could. We can even take the discussion to email if
: > you like. I'll even sign an NDA to look more into the details and exact
: > nature of the problem if needed.

: I'm still working on it -- in between other things.
: Right now all my time is spent trying to get some code ready for a, um, new
: processor.

Good luck. It'll be intersting to see this kind of code, espcially when it first
starts to cause the cache to thrash.

davexl

unread,

Mar 1, 1999, 3:00:00 AM3/1/99

to

ATTN Chris Cox:

Having started this thread, I am now on deadline for a big job, and
without the time to follow this through.

Firstly, until I do get the time, I would like to apologise for
flaming you. I as you may have gathered, am self taught in these
matters, and didn't treat you with the respect you deserve. And it
shouldn't have mattered if you were a newbie or indeed, *ahem* the
author of the code in question, calling you clueless was out of line.

In short, sorry.

Back to the issue that sparked this...

I still have problems with both Photoshop's performance on dual NT
systems and your arguments. (your work as distinct from you!)

If I am way off base or just plain wrong, I would be delighted to hear
it. I would rather be shown to be wrong and learn something. The only
reason I was so strident in assuming you were incorrect was that your
claims bear little relation to what I have seen and tested.

You said I needed education. A fair point, considering who I am
talking to. I would be grateful if you would show me what I have got
wrong.

Common sense / laymans questions that don't makes sense if what you
say is true:

* If most PSD operations are bandwidth limited, why does
performance scale with CPU speed. From all the testing I have done,
clock speed seems to be the major factor.

* If something is bandwidth limited, it should not scale with
CPU speed, correct?

* If that task does scale with CPU speed, and if it is a task
that can be done in parallel, then more often than not it should be
able to scale very well on a dual processor system. Pixels are
something ripe for SMP processing. (unless you work on 1 pixel images
:-)

* Your explanation for Digital Fusion getting 100% scaling on
virtually everything was that they have poor code in great need of
optimisation, hence it being so CPU bound as to saturate any number of
CPUs. I find that hard to accept, as my experience as a user using
many graphics apps has taught me that it is obviously easier to write
bad code than it is good code, and Digital Fusion does not strike me
as bad code. If it is poorly optimised, what does that say for Adobe
After Effects, which in every comparison I have seen is soundly belted
by Digital Fusion for speed? If DF has been carefully written from the
ground up to such an advanced standard, I highly doubt they would have
poor optimisation.

I can now imagine that certain functions would be bandwidth limited,
but I do not accept the argument that in most cases, Photoshop is
tuned so well as to make SMP useless due to a bandwidth bottleneck.

* Where a function in Photoshop is well threaded, it never seems
to rise above about 80% scaling. That is, while 1.8 out of 2 is not
something to complain about, I could be forgiven for thinking that
Photoshop's code is a bit crufty compared to other apps that (when the
going is good) can achieve 1.98+ for 2 CPUs. Why can Photoshop never
seem to saturate 2 CPUs? Surely some function is both CPU bound and
threaded.

* I am particularly startled my your claim of Electric image
being bandwidth limited. My experience in 3D applications is that they
are the programs that benefit the most from multiple processors - I
would describe the calculations they do as totally CPU bound.
Raytracing with heavy geometry at a meaningful resolution on any
application on any platform is hardly a real time game.

In what way is Electric Image bandwidth limited?

In summary, my original post was to the effect that Photoshop is a
very poor example of SMP. This is not a matter for debate, the figures
speak for themselves. The issue has become WHY?. Is it due to the laws
of nature as it were, in that there are solid obstacles why Adobe
cannot make the code run better in parallel due to hardware
limitations etc. Or is it because they have not tried/don't know
how/can't make it work?

I still contend that it is because Adobe has not seriously tried to.
This is not a capital crime, as I still prefer Photoshop over any
alternative.

You have admitted that certain functions listed in the test are seldom
used and thus not worth the effort. (Very reasonable) But I would
submit that RGB to CMYK conversion as a function that certainly
qualifies as "Pro".

Obviously you are victim to the set of functions they might choose to
put in the test. But I do not see much difference in other tests.

Now I contrast this with Digital Fusion again. Obviously they have
taken a bottom up approach that means that rather than hand tuning
each function, the default is SMP. Obviously an easier thing to
achieve in a 3 (?) year old code base starting from scratch on NT than
in an app like Photoshop (10+ years?).

But I cannot excuse Adobe which has had the same amount of TIME and
infinitely more resources than a small company like Eyeon. Adobe
should be ahead of the competition, not falling behind.

In short 99% of the tasks that I see run in graphics are CPU bound
enough to scale with clock speed. I cannot tell if it is 1% ahead of
being bandwidth limited, but the limiting factor is CPU.

I look forward to your reply.

PS, you suggested my benchmarks of Fusion may be suspect.

I test these apps not by how high they peg the CPU meter (although
that is a great guide over time) but by rebooting and running the same
task with one CPU disabled to NT. Each test takes a minimum of ten
mins etc etc.

That is how I recorded 'PB's of 1.99x for Fusion and 1.98x for 3D
Studio MAX. The conditions were optimal, (minimal I/O) but still real
world files.

Chris Cox

unread,

Mar 1, 1999, 3:00:00 AM3/1/99

to

In article <36f95261....@10.1.1.9>, dav...@ozramp.net.au.despam-me!
(davexl) wrote:

> ATTN Chris Cox:
>
> Having started this thread, I am now on deadline for a big job, and
> without the time to follow this through.
>
> Firstly, until I do get the time, I would like to apologise for
> flaming you. I as you may have gathered, am self taught in these
> matters, and didn't treat you with the respect you deserve. And it
> shouldn't have mattered if you were a newbie or indeed, *ahem* the
> author of the code in question, calling you clueless was out of line.
>
> In short, sorry.

Apology accepted.

>
> Back to the issue that sparked this...
>
> I still have problems with both Photoshop's performance on dual NT
> systems and your arguments. (your work as distinct from you!)
>
> If I am way off base or just plain wrong, I would be delighted to hear
> it. I would rather be shown to be wrong and learn something. The only
> reason I was so strident in assuming you were incorrect was that your
> claims bear little relation to what I have seen and tested.
>
> You said I needed education. A fair point, considering who I am
> talking to. I would be grateful if you would show me what I have got
> wrong.
>
> Common sense / laymans questions that don't makes sense if what you
> say is true:

Computer systems are not something that a layman is going to ever make sense of.
There are more variables between your RAM and your CPU than most people imagine.
Then throw in interrupts and multi processing and you've got something that
should be treated more as a stochastic process than a purly deterministic
process. (but some of us beat ourselves up and try to simulate or trace the
whole complicated mess)

> * If most PSD operations are bandwidth limited, why does
> performance scale with CPU speed. From all the testing I have done,
> clock speed seems to be the major factor.

You have to measure clock speed AND the bus speed. Since bus speed is
normally a divisor of the clock speed they're hard to separate. And L2
speed is also tied to the CPU speed in most systems now.

And again, it's not most operations, it's the most common operations. I
don't spend a whole lot of time worrying about a function that nobody likes
or uses.

> * If something is bandwidth limited, it should not scale with
> CPU speed, correct?

Except that bus speed and L2 are also tied to CPU clock speed.
The next question is WHERE is the function bandwidth limited?
In the load/store unit of the CPU?
In the L2 cache?
In the DRAM controller?
In the DRAM devices?
In other words, what does the CPU end up waiting for?

> * If that task does scale with CPU speed, and if it is a task
> that can be done in parallel, then more often than not it should be
> able to scale very well on a dual processor system. Pixels are
> something ripe for SMP processing. (unless you work on 1 pixel images
> :-)

But there are other factors at play than just the CPU speed and bus speed.
Unless something is completely CPU limited, it won't scale well with
additional processors. If the function is CPU limited by only a small
margin, then it will speedup with additional processors and THEN become
bandwidth limited and not improve farther.
Also, there are some algorithms that just don't break down for SMP (or SIMD
:-( and are faster than their counterparts that would break down for SMP
(or SIMD). Thank goodness I don't have many of those.

> * Your explanation for Digital Fusion getting 100% scaling on
> virtually everything was that they have poor code in great need of
> optimisation, hence it being so CPU bound as to saturate any number of
> CPUs. I find that hard to accept, as my experience as a user using
> many graphics apps has taught me that it is obviously easier to write
> bad code than it is good code, and Digital Fusion does not strike me
> as bad code. If it is poorly optimised, what does that say for Adobe
> After Effects, which in every comparison I have seen is soundly belted
> by Digital Fusion for speed?

True, After Effects could use a little more optimization work.
However, there's a good reason that professionals prefer the results from
After Effects. (I can't say more until I spend more time with Digital
Fusion)

> If DF has been carefully written from the
> ground up to such an advanced standard, I highly doubt they would have
> poor optimisation.
>
> I can now imagine that certain functions would be bandwidth limited,
> but I do not accept the argument that in most cases, Photoshop is
> tuned so well as to make SMP useless due to a bandwidth bottleneck.

Sorry, but I guess I optimize a little too well.

> * Where a function in Photoshop is well threaded, it never seems
> to rise above about 80% scaling. That is, while 1.8 out of 2 is not
> something to complain about, I could be forgiven for thinking that
> Photoshop's code is a bit crufty compared to other apps that (when the
> going is good) can achieve 1.98+ for 2 CPUs. Why can Photoshop never
> seem to saturate 2 CPUs? Surely some function is both CPU bound and
> threaded.

Part of that is due to Photoshop's VM system -- the fact that it can edit
images much larger than available RAM. It does impose some restrictions on
how we can split up data among the processors.
There are some other problems there -- but we're trying to work with the OS
vendor to resolve them :-).

> * I am particularly startled my your claim of Electric image
> being bandwidth limited. My experience in 3D applications is that they
> are the programs that benefit the most from multiple processors - I
> would describe the calculations they do as totally CPU bound.
> Raytracing with heavy geometry at a meaningful resolution on any
> application on any platform is hardly a real time game.
>
> In what way is Electric Image bandwidth limited?

In almost every way except for a few special effect textures.
And EI is bandwidth limited partly because they're a scanline renderer, not
a raytracer. Ray tracing with most common algorithms is probably going to
be combination bound -- partly bandwidth (traversing tree structures and
loading object data) and partly CPU (lighting).

> In summary, my original post was to the effect that Photoshop is a
> very poor example of SMP. This is not a matter for debate, the figures
> speak for themselves. The issue has become WHY?. Is it due to the laws
> of nature as it were, in that there are solid obstacles why Adobe
> cannot make the code run better in parallel due to hardware
> limitations etc. Or is it because they have not tried/don't know
> how/can't make it work?

It's partly because of nature, partly because of the already
highly-optimized code, partly because of the chipsets, partly because of
the OS, and partly because we might get 10% more out of it but it's
probably not worth the effort.

>
> I still contend that it is because Adobe has not seriously tried to.

And I have several programmers who would disagree.

> This is not a capital crime, as I still prefer Photoshop over any
> alternative.
>
> You have admitted that certain functions listed in the test are seldom
> used and thus not worth the effort. (Very reasonable) But I would
> submit that RGB to CMYK conversion as a function that certainly
> qualifies as "Pro".

That one is probably L2 cache limited. A bigger L2 or a 1:1 L2 will
probably improve it greatly. MP will help it, but it could thrash and slow
down on some systems.

> Obviously you are victim to the set of functions they might choose to
> put in the test. But I do not see much difference in other tests.

I do -- but then I test the entire application.

> Now I contrast this with Digital Fusion again. Obviously they have
> taken a bottom up approach that means that rather than hand tuning
> each function, the default is SMP. Obviously an easier thing to
> achieve in a 3 (?) year old code base starting from scratch on NT than
> in an app like Photoshop (10+ years?).

True, but threading the whole application can actually lead to problems
with SMP (and MMX, and Katmai, and ...) that threading individual functions
avoids. And there are some system services that just do NOT like too many
different threads calling them.

>
> But I cannot excuse Adobe which has had the same amount of TIME and
> infinitely more resources than a small company like Eyeon. Adobe
> should be ahead of the competition, not falling behind.

I still think we are ahead, in most areas.
True, we could put some more effort into some of the applications -- but
most of the limiting factors are things that are not going to be easily
understood. (and good optimizers are hard to find -- I keep trying to hire
more)

> In short 99% of the tasks that I see run in graphics are CPU bound
> enough to scale with clock speed. I cannot tell if it is 1% ahead of
> being bandwidth limited, but the limiting factor is CPU.

I really wish someone would write a good book on real-world computer
architecture... Hennesy and Patterson (or Patterson and Hennesy) is just
too theoretical to recommend -- and I can't find anything else that
mentions the fact that the path between the CPU and RAM isn't just an
infinite bandwidth pipe. "Inner Loops" comes close, but doesn't really
give great examples and dives immediately into assembly (where he probably
shouldn't).

Have you looked at the STREAM benchmarks?
See http://reality.sgi.com/mccalpin/
http://www.cs.virginia.edu/stream/ref.html#why gives some of the background
(hmm, and some highly questionable results since the last time I looked).

I'll have to see if I can find some better references to explain this stuff
without requiring so much background.

Chris

David T. Wang

unread,

Mar 2, 1999, 3:00:00 AM3/2/99

to

Chris Cox (cc...@slip.net) wrote:
: (davexl) wrote:

: > In short 99% of the tasks that I see run in graphics are CPU bound

: > enough to scale with clock speed. I cannot tell if it is 1% ahead of
: > being bandwidth limited, but the limiting factor is CPU.

: I really wish someone would write a good book on real-world computer
: architecture... Hennesy and Patterson (or Patterson and Hennesy) is just
: too theoretical to recommend -- and I can't find anything else that
: mentions the fact that the path between the CPU and RAM isn't just an
: infinite bandwidth pipe. "Inner Loops" comes close, but doesn't really
: give great examples and dives immediately into assembly (where he probably
: shouldn't).

: Have you looked at the STREAM benchmarks?
: See http://reality.sgi.com/mccalpin/
: http://www.cs.virginia.edu/stream/ref.html#why gives some of the background
: (hmm, and some highly questionable results since the last time I looked).

I have looked at STREAM for a few times, even had a debate with Jason S.
on this topic a while back. (which lead to my cross posting the
discussion to comp.benchmarks, and Dr McCalpin himself responded)

STREAM was my basis in claiming that your simplified answer of "bandwidth
limitation" didn't make sense at the highest level. STREAM tries to
measure "sustainable memory bandwidth", but it differs significantly
from those operations in PS you've mentioned in that it doesn't have
the code jumping through 10+ regions of memory to grab data, it just
marches along, loads a big matrix, do something with it, and shove it
back into memory, and I haven't seen a compiler which could optimize the
operations to nops. Based on STREAM, and the few things I have learned about
R/W buffers, memory ordering, I had long believed that the memory
subsystem is one of the strengths of Intel's processor line,
(specifically the P6 based units) compared to PPC based processor
implementations. Thus, it was strange for me to hear that PS is
bandwidth limited, and P6 based processors perform badly compared
with PPC based processors.

However, based on the clues you have provided, it's my guess that
the entire issue hinges on L1 thrashing. STREAM doesn't thrash L1,
as it marches through the memory, so it's entirely reasonable that
Intel's processor based systems are obtaining higher STREAM numbers,
and yet suffer badly on your code.

You had mentioned that you do bus traces. I think that my guess
would be easy to prove/disprove with one of your bus traces.
At some intermediary "stream count", the P II based system would
generate a lot more memory fetch request, and some of those
request are redundant requests for lines previously fetched.

Also, I know that you said that you are working on a "new processor",
so you're probably busy. However, when and if you get a chance,
could you work an AMD K6 based system in there? The Super7 based
systems has a worse total bandwidth number (STREAM) but it has
a larger L1 cache, and it would go a long way to validate/invalidate
my guess here.
Later.

: I'll have to see if I can find some better references to explain this stuff

: without requiring so much background.

: Chris

--

Chris Cox

unread,

Mar 3, 1999, 3:00:00 AM3/3/99

to

In article <7bg6m2$3lb$3...@hecate.umd.edu>,

dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
Wang) wrote:

> I have looked at STREAM for a few times, even had a debate with Jason S.
> on this topic a while back. (which lead to my cross posting the
> discussion to comp.benchmarks, and Dr McCalpin himself responded)
>
> STREAM was my basis in claiming that your simplified answer of "bandwidth
> limitation" didn't make sense at the highest level. STREAM tries to
> measure "sustainable memory bandwidth", but it differs significantly
> from those operations in PS you've mentioned in that it doesn't have
> the code jumping through 10+ regions of memory to grab data, it just
> marches along, loads a big matrix, do something with it, and shove it
> back into memory, and I haven't seen a compiler which could optimize the
> operations to nops.

I used an experimental vector optimizer on STREAM, and it was reduced to
just timing calls and printf's. I was just trying to optimize it, I didn't
realize at first that it was that badly designed. I suspect that the Intel
compiler might optimize away sections of it as well (with the appropriate
switches).

BTW - I did go back and look at the MP code for Photoshop 5.
Some of the color conversions are not threaded in 5 -- some of the core
code was changed very late in the cycle and the threading wasn't completed
until too late to make the ship date. Then someone veto'ed updating it in
the dot release.
I knew I remembered writing the code, but then politics killed it and in
the year plus since I wrote it I forgot about that little detail.
Too much time, too little going on. Strike that, reverse it.

> Based on STREAM, and the few things I have learned about
> R/W buffers, memory ordering, I had long believed that the memory
> subsystem is one of the strengths of Intel's processor line,
> (specifically the P6 based units) compared to PPC based processor
> implementations. Thus, it was strange for me to hear that PS is
> bandwidth limited, and P6 based processors perform badly compared
> with PPC based processors.
>
> However, based on the clues you have provided, it's my guess that
> the entire issue hinges on L1 thrashing.

It's not entirely L1, some of it is TLBs, L2, the DRAM controller and DRAM
pages. I already get different results by changing the SIMMs in the system
(usually only a few percent different, but statistically valid
differences).
I'm still trying to figure out a way to show (and measure) problems in each
sub-system with some generic code. It ain't easy.

> STREAM doesn't thrash L1,
> as it marches through the memory, so it's entirely reasonable that
> Intel's processor based systems are obtaining higher STREAM numbers,
> and yet suffer badly on your code.
>
> You had mentioned that you do bus traces. I think that my guess
> would be easy to prove/disprove with one of your bus traces.

Do you know how long those take to analyze?
Hmm - I'll have to get together with some of the Intel motherboard guys and
see if they want to try this. Or see if I can get an up-to-date board
simulation program and lots of free time for it to run :-).

> At some intermediary "stream count", the P II based system would
> generate a lot more memory fetch request, and some of those
> request are redundant requests for lines previously fetched.

Which should hit in the L2, most of the time.
But if _that_ thrashes....

> Also, I know that you said that you are working on a "new processor",
> so you're probably busy. However, when and if you get a chance,
> could you work an AMD K6 based system in there? The Super7 based
> systems has a worse total bandwidth number (STREAM) but it has
> a larger L1 cache, and it would go a long way to validate/invalidate
> my guess here.

Already working on it (and a few other suprises).

Chris

David T. Wang

unread,

Mar 4, 1999, 3:00:00 AM3/4/99

to

Chris Cox (cc...@slip.net) wrote:
: dave...@cslab.kecl.ntt.co.jp.DELETE.delete.DELETE@Glue.umd.edu (David T.
: Wang) wrote:

: > I have looked at STREAM for a few times, even had a debate with Jason S.
: > on this topic a while back. (which lead to my cross posting the
: > discussion to comp.benchmarks, and Dr McCalpin himself responded)
: >
: > STREAM was my basis in claiming that your simplified answer of "bandwidth
: > limitation" didn't make sense at the highest level. STREAM tries to
: > measure "sustainable memory bandwidth", but it differs significantly
: > from those operations in PS you've mentioned in that it doesn't have
: > the code jumping through 10+ regions of memory to grab data, it just
: > marches along, loads a big matrix, do something with it, and shove it
: > back into memory, and I haven't seen a compiler which could optimize the
: > operations to nops.

: I used an experimental vector optimizer on STREAM, and it was reduced to
: just timing calls and printf's. I was just trying to optimize it, I didn't
: realize at first that it was that badly designed. I suspect that the Intel
: compiler might optimize away sections of it as well (with the appropriate
: switches).

It might, but not the plain old gcc -O3 or -O4, that's for sure. Just
recently I tinkered with it a bit, and played with the optimization
switches, as was suggested, and the difference between the optimizations
was a few percent.

-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 290.9753 0.2202 0.2199 0.2208
Scale: 288.8373 0.2217 0.2216 0.2218
Add: 344.3353 0.2788 0.2788 0.2790
Triad: 304.6614 0.3154 0.3151 0.3158

Function Rate (MB/s) RMS time Min time Max time
Copy: 305.0379 0.2102 0.2098 0.2112
Scale: 305.3537 0.2097 0.2096 0.2100
Add: 367.9768 0.2609 0.2609 0.2610
Triad: 309.1708 0.3107 0.3105 0.3109
------------------------------------------------------------

-0 for the top, and -O4 for the bottom.

STREAM tries to measure a kind of "raw bandwidth", and marches
across memory in an incredibly predictable fashion, it's very
simple, and as Dr McCalpin himself explained, the computational
latency is entirely irrelevent, unless you're running it on a
386 with a separate 387 FPU or something like that. Then
ofcourse the computational latency matters.

: BTW - I did go back and look at the MP code for Photoshop 5.

: Some of the color conversions are not threaded in 5 -- some of the core
: code was changed very late in the cycle and the threading wasn't completed
: until too late to make the ship date. Then someone veto'ed updating it in
: the dot release.
: I knew I remembered writing the code, but then politics killed it and in
: the year plus since I wrote it I forgot about that little detail.
: Too much time, too little going on. Strike that, reverse it.

: > Based on STREAM, and the few things I have learned about
: > R/W buffers, memory ordering, I had long believed that the memory
: > subsystem is one of the strengths of Intel's processor line,
: > (specifically the P6 based units) compared to PPC based processor
: > implementations. Thus, it was strange for me to hear that PS is
: > bandwidth limited, and P6 based processors perform badly compared
: > with PPC based processors.
: >
: > However, based on the clues you have provided, it's my guess that
: > the entire issue hinges on L1 thrashing.

: It's not entirely L1, some of it is TLBs, L2, the DRAM controller and DRAM
: pages. I already get different results by changing the SIMMs in the system
: (usually only a few percent different, but statistically valid
: differences).

Yes, but as you might have noticed, I keep on ignoring the issue with
TLB's, DRAM controllers, DRAM pages. It's not that I am wholly
ignorant of their effects, it's just that at worst, you can perhaps
add a single cycle or two of latency to a TLB lookup (unless you're
whacking TLB's too) and the memory subsystem at best would give you
a differential of a single bus cycle or two with regards to the same
open pages, or if the page is not open, setup the CAS, then RAS,
and you have to wait some 11 or 13 cycles... However, that is
precisely the kind of thing STREAM trys to hit. And we're basically
all using PC100 SDRAM, and so the variance would, as you are seeing,
a few percent one way or the other. What I wanted to examine, was
why we see a big performance delta ( >> 100%) on some of the same
functions between the x86 and the PPC platform, and I just quickly
ruled out the memory subsystem outside of the CPU. They look
similar enough between a Mac and an x86 box that such a large
delta can't be attributed to them. It doesn't look like the ISA
has much to do with it either, as you say it is entirely bandwidth
limited. That points me in some way to look at the implementation
details inside the micro, and so far, the "L1 thrashing" theory
just sounds more plausible to me.

: I'm still trying to figure out a way to show (and measure) problems in each

: sub-system with some generic code. It ain't easy.

acknowledged. If it were easy, they wouldn't need computer engineers to
do such work. :)

: > STREAM doesn't thrash L1,

: > as it marches through the memory, so it's entirely reasonable that
: > Intel's processor based systems are obtaining higher STREAM numbers,
: > and yet suffer badly on your code.
: >
: > You had mentioned that you do bus traces. I think that my guess
: > would be easy to prove/disprove with one of your bus traces.

: Do you know how long those take to analyze?

"easy" is a relative term obviously. :) Yes, staring at 100MB+
worth of bus traces probably isn't a lot of fun.

: Hmm - I'll have to get together with some of the Intel motherboard guys and

: see if they want to try this. Or see if I can get an up-to-date board
: simulation program and lots of free time for it to run :-).

I would suggest looking deeper. Into the processor itself, if you're
not already doing it.

: > At some intermediary "stream count", the P II based system would

: > generate a lot more memory fetch request, and some of those
: > request are redundant requests for lines previously fetched.

: Which should hit in the L2, most of the time.
: But if _that_ thrashes....

The L2 has the same associativity as the L1 on the PII/PIII class of
processors, it has less associativity than L1 on the PPC750 processor.
If you get unlucky, and hit the exact same address alignment, it'll
thrash just as fast as L1 thrashed.

If you're doing truly random access, then you'll only get "unlucky"
only a few times, but from your description of the manner in which
you are marching across memory, if you're "unlucky", you can get
stuck in "unlucky mode", and forever thrash L1, and L2 won't catch
a thing either.

: > Also, I know that you said that you are working on a "new processor",

: > so you're probably busy. However, when and if you get a chance,
: > could you work an AMD K6 based system in there? The Super7 based
: > systems has a worse total bandwidth number (STREAM) but it has
: > a larger L1 cache, and it would go a long way to validate/invalidate
: > my guess here.

: Already working on it (and a few other suprises).

: Chris

--

Chris Cox

unread,

Mar 4, 1999, 3:00:00 AM3/4/99

to

TLB misses can be deadly -- and much more than a cycle or 2.
The DRAM controller can make DRAM page misses even worse (in some cases a
LOT worse).
And I'm still working on the test programs to show some of this...
(Free time? What's free time?)

> : Hmm - I'll have to get together with some of the Intel motherboard guys and
> : see if they want to try this. Or see if I can get an up-to-date board
> : simulation program and lots of free time for it to run :-).
>
> I would suggest looking deeper. Into the processor itself, if you're
> not already doing it.

That's why I'm thinking about simulators.
The good ones simulate the processor AND the memory system (sans interrupts
and video, of course).

Chris

David T. Wang

unread,

Mar 5, 1999, 3:00:00 AM3/5/99

to

Chris Cox (cc...@slip.net) wrote:

: > Yes, but as you might have noticed, I keep on ignoring the issue with

: > TLB's, DRAM controllers, DRAM pages. It's not that I am wholly
: > ignorant of their effects, it's just that at worst, you can perhaps
: > add a single cycle or two of latency to a TLB lookup (unless you're
: > whacking TLB's too) and the memory subsystem at best would give you
: > a differential of a single bus cycle or two with regards to the same
: > open pages, or if the page is not open, setup the CAS, then RAS,
: > and you have to wait some 11 or 13 cycles... However, that is
: > precisely the kind of thing STREAM trys to hit. And we're basically
: > all using PC100 SDRAM, and so the variance would, as you are seeing,
: > a few percent one way or the other. What I wanted to examine, was
: > why we see a big performance delta ( >> 100%) on some of the same
: > functions between the x86 and the PPC platform, and I just quickly
: > ruled out the memory subsystem outside of the CPU. They look
: > similar enough between a Mac and an x86 box that such a large
: > delta can't be attributed to them. It doesn't look like the ISA
: > has much to do with it either, as you say it is entirely bandwidth
: > limited. That points me in some way to look at the implementation
: > details inside the micro, and so far, the "L1 thrashing" theory
: > just sounds more plausible to me.

: TLB misses can be deadly -- and much more than a cycle or 2.

: The DRAM controller can make DRAM page misses even worse (in some cases a
: LOT worse).

Yes, but those events still can't account for the large performance
disparity that we see between a "PC" and a PPC based Mac on some of
these specific functions. (I might concede on some function which
entirely thrashes the TLB on a P6 cased processor, but lives well
on a PPC based one, but I still see that as unlikely. There is
a thread in comp.arch, about the cost of chasing pointers, and
at some point, the TLB just gets whacked, but the P6 based CPUs
seems to hold up very well against PPC based systems, so this
possibility seems slim as well) And I still very much discount the
DRAM subsystem problem, I can't see the DRAM controller being a
significant contributor of performance delta between a "PC" and a
PPC based Mac. A few percent here or there, sure, but not the
large delta we end up discussing. Those same DRAM subsystem is
good enough to generate good STREAM numbers, as well as good numbers
to support SPECFP....

: And I'm still working on the test programs to show some of this...

: (Free time? What's free time?)

: > : Hmm - I'll have to get together with some of the Intel motherboard guys and

: > : see if they want to try this. Or see if I can get an up-to-date board
: > : simulation program and lots of free time for it to run :-).
: >
: > I would suggest looking deeper. Into the processor itself, if you're
: > not already doing it.

: That's why I'm thinking about simulators.

: The good ones simulate the processor AND the memory system (sans interrupts
: and video, of course).

...

Pete Cooper

unread,

Mar 8, 1999, 3:00:00 AM3/8/99

to

>Don't tell me you have personally seen that the world is round.

Hmm. It's kinda obvious when you stand at the shore and you see the masts of
boats dipping below the horizon. It's also fairly obvious when you consider
the sky is like a huge dome around us..

Then again, it wasn't obvious to the people 500 years ago ;o)

Regards,
Pete