Trying to optimize PHP on a T5140

hume.sp...@bofh.ca

unread,

Feb 4, 2010, 3:05:04 PM2/4/10

to

I've got some developers who have obtained a "php benchmarking" script,
obtained from http://www.free-webhosts.com/php-benchmark-script.php .
They started using this while investigating perceived slowness in their
Horde Webmail application.

What this script does is essentially summed up as:

for($i = 1; $i <= 20000; $i++) {
$x=$i * 5;
$x=$x + $x;
$x=$x/10;
$string3 = $string1 . strrev($string1);
$string2 = substr($string1, 9, 1) . substr($string1, 0, 9);
$string1 = $string2;
}

So the entire test is essentially a boatload of arithmetic and string
manipulation. But I can't really dismiss it as an unfair test, since a
webmail client is essentially a pile of string manipulation.

Now the T5140s, when running this script, average around 448ms for the 20k
iterations when using both Webstack PHP and local-compiled PHP. If I
recompile PHP with Sun Studio 12u1 with "-fast" I can get that down to an
average of 390 ms. Oddly enough -xipo doesn't do as well, with an average
of 404 ms.

In contrast, a Xeon box running Linux (2.2GHz) averages 40 ms. Yes, the
x86 runs at twice the clock speed; but it delivers ten times the performance
(both machines unloaded).

Anyone have any suggestions I could look at to make the Suns more competitive?
Or at least explain definitively why the Suns do worse? (I've already
pointed out that the Coolthreads machines are built for power efficiency
and parallelism...)

--
Brandon Hume - hume -> BOFH.Ca, http://WWW.BOFH.Ca/

Message has been deleted

Andrew Gabriel

unread,

Feb 4, 2010, 4:10:02 PM2/4/10

to

In article <hkf99g$dl9$1...@kil-nws-1.ucis.dal.ca>,

hume.sp...@bofh.ca writes:
> I've got some developers who have obtained a "php benchmarking" script,
> obtained from http://www.free-webhosts.com/php-benchmark-script.php .
> They started using this while investigating perceived slowness in their
> Horde Webmail application.
>
> What this script does is essentially summed up as:
>
> for($i = 1; $i <= 20000; $i++) {
> $x=$i * 5;
> $x=$x + $x;
> $x=$x/10;
> $string3 = $string1 . strrev($string1);
> $string2 = substr($string1, 9, 1) . substr($string1, 0, 9);
> $string1 = $string2;
> }
>
> So the entire test is essentially a boatload of arithmetic and string
> manipulation. But I can't really dismiss it as an unfair test, since a
> webmail client is essentially a pile of string manipulation.
>
> Now the T5140s, when running this script, average around 448ms for the 20k
> iterations when using both Webstack PHP and local-compiled PHP. If I
> recompile PHP with Sun Studio 12u1 with "-fast" I can get that down to an
> average of 390 ms. Oddly enough -xipo doesn't do as well, with an average
> of 404 ms.
>
> In contrast, a Xeon box running Linux (2.2GHz) averages 40 ms. Yes, the
> x86 runs at twice the clock speed; but it delivers ten times the performance
> (both machines unloaded).

You're only using somewhere between 1% and 6% of the T5140.

> Anyone have any suggestions I could look at to make the Suns more competitive?
> Or at least explain definitively why the Suns do worse? (I've already
> pointed out that the Coolthreads machines are built for power efficiency
> and parallelism...)

How many mail users are there going to be? Only one?

Run 256 of them in parallel, and compare the total times.

If your mail sessions are encrypted, add in a couple of hundred
cryto sessions on top (all done in hardware in the T5140 if you
use the crypto framework correctly).

The Xeon box should be left miles behind.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]

hume.sp...@bofh.ca

unread,

Feb 5, 2010, 12:39:01 PM2/5/10

to

Andrew Gabriel <and...@cucumber.demon.co.uk> wrote:
>> In contrast, a Xeon box running Linux (2.2GHz) averages 40 ms. Yes, the
>> x86 runs at twice the clock speed; but it delivers ten times the performance
>> (both machines unloaded).
>
> You're only using somewhere between 1% and 6% of the T5140.

I realize this. I KNOW the 5140 will blow the Xeon out of the water under
massive load. But... for that one, single-threaded process... running at
half the clock rate you'd expect the process to take twice as long, while
the other 127 vcpus twiddled their thumbs because they couldn't help out.

The question I'm being asked by the developers is: if the Sun runs at half
the clock rate, 40 ms becomes 80 ms, being generous and round it up to 100 ms.
Where is the other 290 ms going? Is it being lost to context switching? Is
the nature of the way PHP does substring calls hostile to the cache? (I've
run into that problem before, though not with PHP...) Something else?

I managed to squeeze another 14% performance out of PHP by recompiling PHP
with SS12u1 and enabling the -fast CFLAGS.

> How many mail users are there going to be? Only one?

No, thousands. And I fully expect the Suns to shine in that environment.
But we all know that the END user doesn't care that you're supporting a
thousand more users than any x86 could... as far as they're concerned, they
want one server per user if it means you make the page refresh twice as fast.

(I actually commented to the devs that we could replace the two 5140s with
32 Xeon boxes... I don't think the tone of my comment went over well... :) )

> If your mail sessions are encrypted, add in a couple of hundred
> cryto sessions on top (all done in hardware in the T5140 if you
> use the crypto framework correctly).

I wish. The primary application on these machines isn't Horde, but a vendor
black box being set up exactly to vendor recommendations... which includes
a Cisco load balancer handling all the SSL.

Drazen Kacar

unread,

Feb 5, 2010, 1:36:02 PM2/5/10

to

hume.sp...@bofh.ca wrote:

> The question I'm being asked by the developers is: if the Sun runs at
> half the clock rate, 40 ms becomes 80 ms, being generous and round it
> up to 100 ms. Where is the other 290 ms going? Is it being lost to
> context switching? Is the nature of the way PHP does substring calls
> hostile to the cache? (I've run into that problem before, though not
> with PHP...) Something else?

Perhaps. Perhaps malloc implementation is trying to conserve memory,
so it's taking too much time. Or something else entirely.

You could try profiling the application. The easiest way is to use
LD_PROFILE (man ld.so.1).

> I managed to squeeze another 14% performance out of PHP by recompiling PHP
> with SS12u1 and enabling the -fast CFLAGS.

You could try with -xprofile. That's cheating, but I wonder if that would
help.

--
.-. .-. Yes, I am an agent of Satan, but my duties are largely
(_ \ / _) ceremonial.
|
| da...@fly.srk.fer.hr

David Kirkby

unread,

Feb 6, 2010, 3:14:51 AM2/6/10

to

On 4 Feb, 20:05, hume.spamfil...@bofh.ca wrote:
> I've got some developers who have obtained a "php benchmarking" script,
> obtained fromhttp://www.free-webhosts.com/php-benchmark-script.php.
> They started using this while investigating perceived slowness in their
> Horde Webmail application.

<SNIP>

> Anyone have any suggestions I could look at to make the Suns more competitive?
> Or at least explain definitively why the Suns do worse? (I've already
> pointed out that the Coolthreads machines are built for power efficiency
> and parallelism...)
>
> --
> Brandon Hume - hume -> BOFH.Ca,http://WWW.BOFH.Ca/

As as been pointed out, the benchmark you give is single threaded and
so would not run this well. There is nothing more to say about that.

I just Googled the T5140 and find a broken link, but the dual-
processor verssion, the T5240, has a valid link.

I use a T5240 myself. I think the blurb on the T5240 could be more
accurate and honest. To quote from that page.

http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/cmt-servers/031584.htm

"Watch the T5240 server blaze through anything you throw its way,"

Obviously the page describes the parallism, but nowhere does it say
the single-threaded performance will be poor, so it is not suited for
all tasks.

I'm pretty sure, had the information about the poor single-threaded
performance beeen noted, then University of Washington would not have
accepted the donation of a T5240 by Sun. . A couple of sentences such
as

"It should be noted the T5240 is not suitable for all applications,
and in particular it will not perform well for a small number of
single-threaded tasks"

It's true to say its key applications are quoted, but I think it would
be sensible to point out the limitations.

Sun have a habbit of doing this. The spec sheet on the Ultra 27 states
it can take take 12 GB of RAM and the RAM runs at 1333 MHz. Nowhere
does it say that the speed drops if the memory is over 6 GB. I only
found thst burried deep in te service manual.

I've got 12 GB in my Ultra 27, and are very pleased with its
perforance. The 3.33 GHz processor is faster than any other machine
I've used, expect on machines with many more cores, using multi-
threaded code.

In the case of the Ultra 27, the memory issue is a minor one. In the
case of the T5240, I believe a few sentances describing tasks it is
not suitable for would be a good idea. If nothing else, it would be
someting to point someone at, every time there are reports of poor
performance from these machines.

Dave

ChrisS

unread,

Feb 6, 2010, 11:35:07 AM2/6/10

to

On Feb 6, 3:14 am, David Kirkby <drkir...@gmail.com> wrote:
> On 4 Feb, 20:05, hume.spamfil...@bofh.ca wrote:
>
> > I've got some developers who have obtained a "php benchmarking" script,
> > obtained fromhttp://www.free-webhosts.com/php-benchmark-script.php.
> > They started using this while investigating perceived slowness in their
> > Horde Webmail application.
>
> <SNIP>
>
> > Anyone have any suggestions I could look at to make the Suns more competitive?
> > Or at least explain definitively why the Suns do worse? (I've already
> > pointed out that the Coolthreads machines are built for power efficiency
> > and parallelism...)
>
> > --
> > Brandon Hume - hume -> BOFH.Ca,http://WWW.BOFH.Ca/
>
> As as been pointed out, the benchmark you give is single threaded and
> so would not run this well. There is nothing more to say about that.
>
> I just Googled the T5140 and find a broken link, but the dual-
> processor verssion, the T5240, has a valid link.
>
> I use a T5240 myself. I think the blurb on the T5240 could be more
> accurate and honest. To quote from that page.
>

> http://www.oracle.com/us/products/servers-storage/servers/sparc-enter...

Not to start a fight between admins and developers, but after admins
have thrown more horse-power at a web application it's time to get the
developers to earnestly re-look at their own code. I've had our web
developers do that after I've exhausted server-side solutions. The
developers, more times than not, find a better way of writing their
code, and speeding up their apps 2 or 3-fold. In a few instances it
was simply changing the logical order of processing their code. I
love when they admit defeat. :-) Having a truly open dialog between
admin & devs is priceless.

Good luck

Andrew Gabriel

unread,

Feb 6, 2010, 8:03:52 PM2/6/10

to

In article <hkhl3l$503$1...@kil-nws-1.ucis.dal.ca>,

hume.sp...@bofh.ca writes:
> Andrew Gabriel <and...@cucumber.demon.co.uk> wrote:
>>> In contrast, a Xeon box running Linux (2.2GHz) averages 40 ms. Yes, the
>>> x86 runs at twice the clock speed; but it delivers ten times the performance
>>> (both machines unloaded).
>>
>> You're only using somewhere between 1% and 6% of the T5140.
>
> I realize this. I KNOW the 5140 will blow the Xeon out of the water under
> massive load. But... for that one, single-threaded process... running at
> half the clock rate you'd expect the process to take twice as long, while
> the other 127 vcpus twiddled their thumbs because they couldn't help out.
>
> The question I'm being asked by the developers is: if the Sun runs at half
> the clock rate, 40 ms becomes 80 ms, being generous and round it up to 100 ms.

Not as simple as that. If you look at a Xeon, or Ultrasparc, or Sparc64,
these have long pipelines and process several instructions in parallel.
This enables them to look ahead and predict what memory accesses they'll
need and fire off the requests in advance so they don't waste as much
time later with a pipeline stall. The logic supporting this pipeline is
much bigger than the logic performing the conventional CPU functions.

The T series processors don't have this. Instead, they are designed to
handle pipeline stalls simply by doing a very fast context switch to
another thread, and leaving the stalled thread to do its memory access
whilst another thread is running. This works very well when you have
lots of runnable threads - you find the stalled time when a T series
core can't do anything is typically much less than that of a long
pipeline core, which is why its performance flies, and it doesn't have
all that extra heat-generating pipeline logic. However, if you only
have one thread, that's going to get loads more pipeline stalls than
it would on a long pipeline processor, so even at the same clock speed,
it will be significantly slower.

> Where is the other 290 ms going? Is it being lost to context switching? Is

There's no context switching when you have only one thread. It's lost
in pipeline stalls because the logic to avoid them isn't there.

> the nature of the way PHP does substring calls hostile to the cache? (I've
> run into that problem before, though not with PHP...) Something else?

There's something else which might add to this. If the flow of logic
through the compiled PHP binary keeps calling and returning through lots
of deeply stacked functions, it will be generating lots of spill/fill
register window traps. Sparc is very fast at function calls because of the
way it keeps multiple register sets in the CPU, but when you exceed the
CPU's capability to store them, it has to spill them out to memory, and
conversely fill them back up again as you return through the large number
of stack frames.

> I managed to squeeze another 14% performance out of PHP by recompiling PHP
> with SS12u1 and enabling the -fast CFLAGS.

If you aren't already, see if -xO4 makes any difference; this should
perform function inlining and tail-call optimisation, both of which
will reduce number of register window sets used, if this is part of
the problem. (A longer read through the cc options might reveal some
other appropriate ones here - not something I know off the top of my
head.)

Andrew Gabriel

unread,

Feb 6, 2010, 8:07:52 PM2/6/10

to

In article <a8b10051-7348-4558...@a1g2000vbl.googlegroups.com>,

ChrisS <chris....@gmail.com> writes:
> Not to start a fight between admins and developers, but after admins
> have thrown more horse-power at a web application it's time to get the
> developers to earnestly re-look at their own code. I've had our web
> developers do that after I've exhausted server-side solutions. The
> developers, more times than not, find a better way of writing their
> code, and speeding up their apps 2 or 3-fold. In a few instances it
> was simply changing the logical order of processing their code. I
> love when they admit defeat. :-) Having a truly open dialog between
> admin & devs is priceless.

Something I've done in this circumstance many times is to run
analyzer(1) on the app, and then hand the histograms back to the
developers. It usually results in comments like "but we shouldn't
even be going in to this code", whilst pointing at something which
is using 90% of the CPU, such as some debugging functions...

hume.sp...@bofh.ca

unread,

Feb 7, 2010, 6:44:02 AM2/7/10

to

Andrew Gabriel <and...@cucumber.demon.co.uk> wrote:
> Not as simple as that. If you look at a Xeon, or Ultrasparc, or Sparc64,
> these have long pipelines and process several instructions in parallel.

This is exactly the kind of explanation I was looking for (and educational
to myself to boot). Thanks for taking the time to write it out.

> If you aren't already, see if -xO4 makes any difference; this should
> perform function inlining and tail-call optimisation, both of which

-fast is a macro that turns on -xO5... so that's taken care of. The next
step is using -xprofile to turn on profiling collect/use, but that increases
compile time by orders of magnitude and I'm not experienced enough in how to
use it properly. There's a guide on wiki.sun.com, even specialized for
profiling PHP, but the information there seems incomplete.