How do i check at what frequency DDR2 is running at in pandaboard?
And
also how do i check if L1 and L2 caches are enabled
and what are cache
policies?
there've already been some discussion that might be helpful:
http://groups.google.com/group/pandaboard/browse_thread/thread/66f63260c305f1e8/9b5f2b8df4426067
I've used omap4_emif from http://elinux.org/Board_Bringup_Utilities sometime
--
regards,
Sergey
> I used omap4_emif to compute DDR2 clock. It says 200MHz. Is there
> anyway i can clock the DDR2 at 400MHz?
That tool prints half the actual clock frequency, so your DDR is being
clocked at 400MHz.
--
Måns Rullgård
ma...@mansr.com
But the STREAM benchmark shows the Pandaboard memory bandwidth is far
bellow DDR2-400 peak transfer rate: 3200MB/s:
Pandaboard (A1, ubuntu-11.04, L2-D prefetch by default on)
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 298.7389 0.1077 0.1071 0.1102
Scale: 299.0824 0.1074 0.1070 0.1078
Add: 264.1703 0.1820 0.1817 0.1823
Triad: 433.8945 0.1109 0.1106 0.1112
Pandaboard's score is half of Tegra (DDR2-667), although OMAP4 is dual-channel.
-------------------------------------------------------------
Copy: 733.1017 0.0441 0.0437 0.0451
Scale: 643.9154 0.0505 0.0497 0.0531
Add: 815.2425 0.0594 0.0589 0.0600
Triad: 728.9635 0.0666 0.0658 0.0674
So where is the bottleneck? Memory controller? L2 Cache controller? Bus?
-Yi
As you have access to both pieces of hardware, could you please try
this benchmark one more time after compiling it with an extra
-fprefetch-loop-arrays gcc optimization option? I'm just curious and
wonder if it will affect relative Pandaboard/Tegra2 performance in
this particular benchmark.
--
Best regards,
Siarhei Siamashka
The interface between the A9 and the memory controller is, let's be
diplomatic, suboptimal. This is compounded by ROM bugs on the early
OMAP4 revisions (all currently available) preventing optimal cache
configuration. The latter problem will be fixed in an upcoming silicon
revision, the former (allegedly) in the 4460 once it hits the streets.
--
Måns Rullgård
ma...@mansr.com
Is this performance bottleneck related to Cortex-A9 r1p2 revision in
any way? Or something is suboptimal on the OMAP side (other than cache
configuration)? Or a bit of both?
And if I want to get some Cortex-A9 based device with the fastest
memory right now, do you have any recommendation which one would be a
better choice?
> 2011/5/25 Måns Rullgård <ma...@mansr.com>:
>> "Li Yi (Adam)" <liyi...@gmail.com> writes:
>>> So where is the bottleneck? Memory controller? L2 Cache controller? Bus?
>>
>> The interface between the A9 and the memory controller is, let's be
>> diplomatic, suboptimal. This is compounded by ROM bugs on the early
>> OMAP4 revisions (all currently available) preventing optimal cache
>> configuration. The latter problem will be fixed in an upcoming silicon
>> revision, the former (allegedly) in the 4460 once it hits the streets.
>
> Is this performance bottleneck related to Cortex-A9 r1p2 revision in
> any way? Or something is suboptimal on the OMAP side (other than cache
> configuration)? Or a bit of both?
The A9 itself is OK, from what I can tell. The problems are somewhere
in the interfacing between the MPU and DMM blocks. The IVAHD seems to
be getting all the bandwidth it needs.
> And if I want to get some Cortex-A9 based device with the fastest
> memory right now, do you have any recommendation which one would be a
> better choice?
I don't know which is the fastest. I don't even know all the A9
chips/boards coming out.
--
Måns Rullgård
ma...@mansr.com
This does not look right to me. For the comparison, here are some
results from Pandaboard EA1 (RAM is clocked at just half of the
standard frequency!)
wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c
Compiling it with gcc 4.5.2 and different optimization options, I'm
getting the following results:
-O2
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 236.6480 0.1361 0.1352 0.1364
Scale: 226.5248 0.1417 0.1413 0.1422
Add: 336.0119 0.1431 0.1429 0.1433
Triad: 338.0328 0.1449 0.1420 0.1644
-O2 -fprefetch-loop-arrays
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 513.5024 0.0645 0.0623 0.0789
Scale: 561.3432 0.0572 0.0570 0.0575
Add: 325.9167 0.1475 0.1473 0.1477
Triad: 331.2698 0.1454 0.1449 0.1468
-O2 -fopenmp
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 249.6628 0.1287 0.1282 0.1293
Scale: 249.9003 0.1284 0.1281 0.1288
Add: 593.5371 0.0813 0.0809 0.0820
Triad: 587.5524 0.0830 0.0817 0.0836
-O2 -fopenmp -fprefetch-loop-arrays
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 250.3189 0.1280 0.1278 0.1282
Scale: 779.6104 0.0413 0.0410 0.0417
Add: 396.1905 0.1223 0.1212 0.1238
Triad: 530.2980 0.0908 0.0905 0.0914
-O2 -fopenmp -fprefetch-loop-arrays -funroll-loops
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 324.4349 0.0992 0.0986 0.0995
Scale: 776.1525 0.0417 0.0412 0.0420
Add: 441.3228 0.1107 0.1088 0.1118
Triad: 431.6358 0.1126 0.1112 0.1136
So depending on gcc optimization options, we get some nice semi-random
benchmark numbers.
> I think the bottleneck is in HW.
Yes, I also think that the bottleneck is in the hardware. But
apparently you also have some additional problems on the software
side, which make performance even worse on your board.
> And if I want to get some Cortex-A9 based device with the fastest
> memory right now, do you have any recommendation which one would be a
> better choice?
>
From my test using Tegra250 board (DDR2-667) and Pandaboard-A1
(LPDDR2-400), obviously Tegra250 has better memory bandwidth (given
STREAM benchmark).
But since the memory bandwidth on both boards are far bellow maximum
data transfer rate of their DDR2 memory chip, I don't think the
difference is caused by memory chip itself.
-Yi
I was using -O0 option, and using an old version of arm-gcc (gcc
version 4.3.3 (Sourcery G++ Lite 2009q1-203)).
When I switch to a newer version of arm-gcc (gcc version 4.5.1
(Sourcery G++ Lite 2010.09-50) ) and -O2, I got better score:
arm-none-linux-gnueabi-gcc -O2 -mfloat-abi=softfp -mcpu=cortex-a9
-march=armv7-a -fprefetch-loop-arrays -static stream.c -o stream.exe
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 844.9550 0.0382 0.0379 0.0385
Scale: 763.7029 0.0421 0.0419 0.0423
Add: 419.7671 0.1146 0.1143 0.1149
Triad: 348.7509 0.1380 0.1376 0.1383
-------------------------------------------------------------
-Yi
Yes, peak memory bandwidth can be estimated by running some tests with
the standard memset function from C programming language. It has some
really impressive performance on OMAP4 (~1.9GB/s on my EA1 board),
especially considering write-allocate cache policy configured for ARM
Cortex-A9 in the linux kernel.
By the way, has anyone seen the following article:
http://www.anandtech.com/show/4225/the-ipad-2-review/4 ? The most
interesting information there is the data for "Stdlib Write
(single-threaded scalar)" and "Stdlib Copy (single-threaded scalar)"
on Apple iPad 2. For me it looks all the same and already familiar
great memset and poor memcpy performance. That's why I was a bit
worried that the issue is not OMAP specific, but applies to all
Cortex-A9 processors (maybe just the initial r1pX revision) regardless
of SoC vendor.
And for the comparison, Intel Atom N450 with DDR2-667 shows ~1.5GB/s
performance for both memset and memcpy. After making sure that memset
is using MOVNTDQ instructions (memory writes are bypassing cache),
memset performance increases to ~3GB/s there, which gives an idea
about how write-allocate may hurt performance on such use cases.
STREAM result on Tegra250 with "-O2 -fprefetch-loop-arrays":
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 1401.9101 0.0229 0.0228 0.0230
Scale: 1369.6285 0.0236 0.0234 0.0244
Add: 768.0492 0.0627 0.0625 0.0630
Triad: 745.5384 0.0645 0.0644 0.0647
-------------------------------------------------------------
-Yi
> If you try to decrease tegra2's memory frequency to 333MHz, you will see the
> similar BW.
This 333MHz would be the DDR clock frequency, right?
> "And if I want to get some Cortex-A9 based device with the fastest
> memory right now, do you have any recommendation which one would be a
> better choice?"
>
> I didn't try but tegra2 does support 800MHz memory controller from their
> kernel code thought they said it only support 667MHz. So you may over clock
> it to 800MHz, the problem is whether memory support. Xoom doesn't, it seems
> AC100 does, just guess.
And these figures seem to be talking about the data rates, i.e. double
clock frequency.
The memory on the Tegra2 Harmony dev board is rated at tCK=2.5ns (400MHz),
but the last time I looked at it, the kernel would only clock it at 333MHz.
The OMAP4 runs at 400MHz.
--
Måns Rullgård
ma...@mansr.com
> On Fri, May 27, 2011 at 9:13 AM, yi li <liyi...@gmail.com> wrote:
>> On Thu, May 26, 2011 at 4:59 AM, Siarhei Siamashka
>> <siarhei....@gmail.com> wrote:
>>
>>> And if I want to get some Cortex-A9 based device with the fastest
>>> memory right now, do you have any recommendation which one would be a
>>> better choice?
>>>
>>
>> From my test using Tegra250 board (DDR2-667) and Pandaboard-A1
>> (LPDDR2-400), obviously Tegra250 has better memory bandwidth (given
>> STREAM benchmark).
>> But since the memory bandwidth on both boards are far bellow maximum
>> data transfer rate of their DDR2 memory chip, I don't think the
>> difference is caused by memory chip itself.
>
> Yes, peak memory bandwidth can be estimated by running some tests with
> the standard memset function from C programming language. It has some
> really impressive performance on OMAP4 (~1.9GB/s on my EA1 board),
The OMAP4 does show very fast writes. I don't know why.
> especially considering write-allocate cache policy configured for ARM
> Cortex-A9 in the linux kernel.
I toyed with turning that off, and it made little difference. The
reason it is there is for proper cache coherency between the two cores.
--
Måns Rullgård
ma...@mansr.com
>> Binwei Yang <binw...@gmail.com> writes:
>>
>> > If you try to decrease tegra2's memory frequency to 333MHz, you
>> > will see the similar BW.
>>
>> This 333MHz would be the DDR clock frequency, right?
>>
>> > "And if I want to get some Cortex-A9 based device with the fastest
>> > memory right now, do you have any recommendation which one would be a
>> > better choice?"
>> >
>> > I didn't try but tegra2 does support 800MHz memory controller from their
>> > kernel code thought they said it only support 667MHz. So you may over clock
>> > it to 800MHz, the problem is whether memory support. Xoom doesn't, it seems
>> > AC100 does, just guess.
>>
>> And these figures seem to be talking about the data rates, i.e. double
>> clock frequency.
>>
>> The memory on the Tegra2 Harmony dev board is rated at tCK=2.5ns (400MHz),
>> but the last time I looked at it, the kernel would only clock it at 333MHz.
>>
>> The OMAP4 runs at 400MHz.
>>
> here I all means MT/s. xoom run at 600MT/s, AC100 run at 667MT/s. But many
> documents said AC100 uses 800MT/s memory. I think it's possible. in linux
> kernel supplied by nvidia, they does use support 800MHz memory contoller, so
> if AC100 does uses 800MT/s memory, you can overclock memory controller
> there. OMAP4 uses LPDDR2 400MT/s memory. So it's expected OMAP4 has longer
> latency than tegra2.
The OMAP4 DDR clock goes up to 400MHz or 800MT/s.
--
Måns Rullgård
ma...@mansr.com
> you mean 4460 will use LPDDR2 800MT/s?
No, I mean the 4430 has that. May I suggest you read the datasheets?
--
Måns Rullgård
ma...@mansr.com
<snip>
>> When I switch to a newer version of arm-gcc (gcc version 4.5.1
>> (Sourcery G++ Lite 2010.09-50) ) and -O2, I got better score:
>>
>> arm-none-linux-gnueabi-gcc -O2 -mfloat-abi=softfp -mcpu=cortex-a9
>> -march=armv7-a -fprefetch-loop-arrays -static stream.c -o stream.exe
>>
>> -------------------------------------------------------------
>> Function Rate (MB/s) Avg time Min time Max time
>> Copy: 844.9550 0.0382 0.0379 0.0385
>> Scale: 763.7029 0.0421 0.0419 0.0423
>> Add: 419.7671 0.1146 0.1143 0.1149
>> Triad: 348.7509 0.1380 0.1376 0.1383
>> -------------------------------------------------------------
>> -Yi
>
> STREAM result on Tegra250 with "-O2 -fprefetch-loop-arrays":
> -------------------------------------------------------------
> Function Rate (MB/s) Avg time Min time Max time
> Copy: 1401.9101 0.0229 0.0228 0.0230
> Scale: 1369.6285 0.0236 0.0234 0.0244
> Add: 768.0492 0.0627 0.0625 0.0630
> Triad: 745.5384 0.0645 0.0644 0.0647
> -------------------------------------------------------------
OK, thanks for sharing this information. So until proven otherwise, we
can safely assume that none of the ARM Cortex-A9 based systems has a
usable hardware prefetcher at the moment.
> So until proven otherwise, we can safely assume that none of the ARM
> Cortex-A9 based systems has a usable hardware prefetcher at the
> moment.
What do you mean by hardware prefetcher? The A9+PL310 has several kinds
of prefetching which can be used:
- Dual line fetch, which fetches two cache lines at a time from RAM with
a configurable distance between them.
- Automatic speculative prefetch based on current access patterns. Off
by default.
- Explicit prefetch with PLD instruction.
- L2 preload engine.
Some of these features are unavailable or partially available on current
silicon, but this should improve in the near future.
--
Måns Rullgård
ma...@mansr.com
I mean the generally accepted concept, which exists even outside ARM
and TI/OMAP scope. There are a number of more or less authoritative
sources, and you can easily find them: "Intel(R) 64 and IA-32
Architectures Optimization Reference Manual", "G5 Performance
Programming", Ulrich Drepper's "What Every Programmer Should Know
About Memory", etc.
Basically it's the thing, which eliminates or greatly reduces the need
for doing software prefetching in the code with easily predictable
memory access patterns (a simple sequential access for example). Such
as this STREAM benchmark.
Had you spent a little more time reading and less being condescending
you might have noticed, in the part of my message you cut, that the A9
does have automatic prefetching. It is currently disabled, by mistake,
on the OMAP4 GP devices. On OMAP4 HS, and presumably other A9-based
chips, it is available.
--
Måns Rullgård
ma...@mansr.com
Thank you, KO. Had you spent a little more time reading what I have
written in my message, you might have noticed the emphasis on "usable"
and "at the moment". The information from Li Yi with the benchmark
data from Tegra2 was actually useful, and provides a hint that
hardware prefetch is also not enabled there "by mistake" or for
whatever other reason. And a part of my older message that you also
decided to cut and avoid commenting, contained the link to some
benchmarks which indicated that iPad 2 (presumably ARM Cortex-A9
based) does not have perfect memory performance either.
Maybe the kernel is simply not enabling it. It is off on reset. It
appears that you have read neither the manuals nor the kernel source,
just making bold statements in a rude tone as always.
> And a part of my older message that you also decided to cut and avoid
> commenting, contained the link to some benchmarks which indicated that
> iPad 2 (presumably ARM Cortex-A9 based) does not have perfect memory
> performance either.
That is the least of the problems with Apple products.
--
Måns Rullgård
ma...@mansr.com
Maybe. I would be very interested to see somebody try it and confirm
that it is actually working by posting relevant benchmark results.
> It appears that you have read neither the manuals nor the kernel source,
> just making bold statements in a rude tone as always.
My "bold" statement basically boils down to "seeing is believing".
What's wrong about it?
From the CTT tool, i see EMIF run at 400MHz and DDRPHY run at 200MHz.
There are prefetchers at a9mpcore and PL310 level. If you turn them on (ES2.2 or 2.3 GP or any EMU device with PPA patch) you will see benchmark improvements. If you just look at memcpy they are positive. There is much more to work loads then >L2 size memcpy's so YMWV.
There are errata scattered around on both. And what you can safely use does depend on which A9 and PL310 rev you are using.
The 4460 does have newer versions of both. The newer PL310's also offer more knobs (like double line fill). Samples do show a good improvement as expected. Probably it will be a while before others can get at them to verify.
The experiments on GP2.1 Panda are limited due to security blocking registers. Other phones or tables should be around soon for expanded trials. Playbook is out there with a newer revision but that is not Linux.
Regards,
Richard W.
>>Maybe. I would be very interested to see somebody try it and confirm
>>that it is actually working by posting relevant benchmark results.
>>
>>> It appears that you have read neither the manuals nor the kernel source,
>>> just making bold statements in a rude tone as always.
>
> There are prefetchers at a9mpcore and PL310 level. If you turn them
> on (ES2.2 or 2.3 GP or any EMU device with PPA patch) you will see
> benchmark improvements. If you just look at memcpy they are positive.
> There is much more to work loads then >L2 size memcpy's so YMWV.
Is the ROM fix enabling full control of the A9 aux control and PL310
registers in ES2.2? Someone said it wouldn't be until ES2.3, and I've
heard nothing authoritative.
--
Måns Rullgård
ma...@mansr.com
It should be in 2.2GP based on old plan. I didn't check post release. I've been using HS for a bit so didn't try it on GP first hand.
Pl310 Aux: R12=109, R0=value
PL310 POR: R12=113, R0=value
The 2.3 fix list I know of doesn't list these. So I assume it made 2.2.
Regards,
Richard W.
FWIW, the latest TRM revision says it's available from 2.2.
--
Måns Rullgård
ma...@mansr.com