Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

8.4-1H1 Performance improvements on HP Integrity bl8x0c–i4 and rx2800-i4 server

132 views
Skip to first unread message

Jan-Erik Soderholm

unread,
Sep 5, 2015, 5:06:25 AM9/5/15
to

John Reagan

unread,
Sep 5, 2015, 9:55:52 AM9/5/15
to
On Saturday, September 5, 2015 at 5:06:25 AM UTC-4, Jan-Erik Soderholm wrote:
> Just happend to come across this:
>
> http://www.downloads.openvmsmigration.com/2015/openvms-performance-on-i4.pdf

The point that Colin missed in the "i2 to i4 differences" is that Poulson is an "out-of-order" machine. In my opinion, that accounts of much of the performance improvement. In places where GEM didn't get perfect bundling or scheduling, the chip's out-of-order instruction processing finds more performance.

Dirk Munk

unread,
Sep 5, 2015, 10:17:04 AM9/5/15
to
Excellent observation Jan-Erik. That was one of the major design faults
in the Itanium, and HP/Intel finally corrected it in Poulson.

Stephen Hoffman

unread,
Sep 5, 2015, 10:24:03 AM9/5/15
to
On 2015-09-05 09:06:29 +0000, Jan-Erik Soderholm said:

> Just happend to come across this:
>
> http://www.downloads.openvmsmigration.com/2015/openvms-performance-on-i4.pdf


i4 bandwidth ~5 GBps to main memory, compare with AlphaServer GS1280
EV7 at ~12.8 GBps.

i4 latency is between ~100 and ~200 ns, compare EV7 average latency at
~260 ns, with a range of 83 ns to 390 ns for the furthest memory in the
torus.

Processor caches and the associated cache latency and bandwidth and
cache sizes play a huge factor here for aggregate application
performance, as main memory is much slower.

If/when the processors and servers arrive, the Kittson improvements
will be interesting. Links to some of the internal documents and
funding information are in the Wikipedia Itanium article
<https://en.wikipedia.org/wiki/Itanium>, for folks interested in a
refresher around those details.

For those that want a refresher on Poulson itself, see
<http://www.realworldtech.com/poulson/>

Did a quick look for Skylake references, and there's very little data
posted. For DDR4 via a recent top-end i7, looks like ~50 GBps, and
latencies ~25 ns.
<http://www.tomshardware.com/reviews/adata-xpg-z1-crucial-ddr4-x99,4007-5.html>
(This data is single-socket, and which avoids the overhead of
coordinating across multiple sockets. Poulson was released ~3 years
ago, so direct comparisons with a more recent Broadwell and
just-released Skylake design are unfair.)

While there are a number of applications that are memory-bound (and the
above data does not address the cache latencies), the newer i4 and
x86-64 boxes are going to be much better at I/O with the PCIe updates,
and with faster controllers.

It's the application improvements and/or the box size reductions and/or
the hardware maintenance support cost reductions that really matter
here, though. Not generic benchmarks.


--
Pure Personal Opinion | HoffmanLabs LLC

Jan-Erik Soderholm

unread,
Sep 5, 2015, 10:49:43 AM9/5/15
to
Den 2015-09-05 kl. 16:17, skrev Dirk Munk:
> John Reagan wrote:
>> On Saturday, September 5, 2015 at 5:06:25 AM UTC-4, Jan-Erik Soderholm
>> wrote:
>>> Just happend to come across this:
>>>
>>> http://www.downloads.openvmsmigration.com/2015/openvms-performance-on-i4.pdf
>>>
>>
>> The point that Colin missed in the "i2 to i4 differences" is that Poulson
>> is an "out-of-order" machine. In my opinion, that accounts of much of
>> the performance improvement. In places where GEM didn't get perfect
>> bundling or scheduling, the chip's out-of-order instruction processing
>> finds more performance.
>>
>
> Excellent observation Jan-Erik.

What did *I* "observe" ??? /Jan-Erik.

Jan-Erik Soderholm

unread,
Sep 5, 2015, 10:55:20 AM9/5/15
to
Den 2015-09-05 kl. 16:24, skrev Stephen Hoffman:
> On 2015-09-05 09:06:29 +0000, Jan-Erik Soderholm said:
>
>> Just happend to come across this:
>>
>> http://www.downloads.openvmsmigration.com/2015/openvms-performance-on-i4.pdf
>
>
> i4 bandwidth ~5 GBps to main memory, compare with AlphaServer GS1280 EV7 at
> ~12.8 GBps.

Is that for a single CPU/socket/process/appliation? Or is it the summed
bandwiths for all available CPUs against different memory areas?
But Still, i4 is better then i2, not?


Stephen Hoffman

unread,
Sep 5, 2015, 11:40:54 AM9/5/15
to
On 2015-09-05 14:55:23 +0000, Jan-Erik Soderholm said:

> Den 2015-09-05 kl. 16:24, skrev Stephen Hoffman:
>>
>> i4 bandwidth ~5 GBps to main memory, compare with AlphaServer GS1280
>> EV7 at ~12.8 GBps.
>
> Is that for a single CPU/socket/process/appliation? Or is it the summed
> bandwiths for all available CPUs against different memory areas?

Quoting from the LANL _Performance Evaluation of an EV7 AlphaServer
Machine_ (LA-UR-02-4850) paper, "The two EV7 on-chip RDRAM memory
controllers support a maximum memory-to-L2 transfer rate of 12 GB/s".

> But Still, i4 is better then i2, not?

Usually, yes. Particularly in terms of the cited performance. But
sometimes not. As with anything else involving benchmarks, the answer
depends on what you have now, and on what factor(s) you're optimizing
for. Most folks aren't running benchmarks, to cite an aphorism.
You're also acquiring new licenses and new support and a different
support organization for engineering support if not for all of your
support. A subset of the layered products were presently available
for V8.4-1H1 from VSI when last I checked, though the VSI folks were
working diligently to make more LPs available.

David Froble

unread,
Sep 5, 2015, 5:17:39 PM9/5/15
to
An observation: someone else who can't get attributes correctly

And a guess: perhaps was the Alpha designers who corrected it?

David Froble

unread,
Sep 5, 2015, 5:23:31 PM9/5/15
to
Stephen Hoffman wrote:
> On 2015-09-05 09:06:29 +0000, Jan-Erik Soderholm said:
>
>> Just happend to come across this:
>>
>> http://www.downloads.openvmsmigration.com/2015/openvms-performance-on-i4.pdf
>>
>
>
> i4 bandwidth ~5 GBps to main memory, compare with AlphaServer GS1280 EV7
> at ~12.8 GBps.
>
> i4 latency is between ~100 and ~200 ns, compare EV7 average latency at
> ~260 ns, with a range of 83 ns to 390 ns for the furthest memory in the
> torus.
>
> Processor caches and the associated cache latency and bandwidth and
> cache sizes play a huge factor here for aggregate application
> performance, as main memory is much slower.
>
> If/when the processors and servers arrive, the Kittson improvements will
> be interesting. Links to some of the internal documents and funding
> information are in the Wikipedia Itanium article
> <https://en.wikipedia.org/wiki/Itanium>, for folks interested in a
> refresher around those details.

But, but, ... according to JF, Kittson will just be a Poulson that tests rather
well ....

> For those that want a refresher on Poulson itself, see
> <http://www.realworldtech.com/poulson/>
>
> Did a quick look for Skylake references, and there's very little data
> posted. For DDR4 via a recent top-end i7, looks like ~50 GBps, and
> latencies ~25 ns.
> <http://www.tomshardware.com/reviews/adata-xpg-z1-crucial-ddr4-x99,4007-5.html>
> (This data is single-socket, and which avoids the overhead of
> coordinating across multiple sockets. Poulson was released ~3 years
> ago, so direct comparisons with a more recent Broadwell and
> just-released Skylake design are unfair.)

Just as it's unfair to compare, what, 8-10 year old Alpha to just released
whatever? Even when sometimes the Alpha compares favorably?

Back to single socket, huh? What goes around comes around ....

:-)

> While there are a number of applications that are memory-bound (and the
> above data does not address the cache latencies), the newer i4 and
> x86-64 boxes are going to be much better at I/O with the PCIe updates,
> and with faster controllers.
>
> It's the application improvements and/or the box size reductions and/or
> the hardware maintenance support cost reductions that really matter
> here, though. Not generic benchmarks.
>
>

Ayep!
0 new messages