Cache and DDR settings

1,441 views
Skip to first unread message

Harish Mahendrakar

unread,
May 23, 2011, 1:46:10 AM5/23/11
to pandaboard
How do i check at what frequency DDR2 is running at in pandaboard? And
also how do i check if L1 and L2 caches are enabled and what are cache
policies?
I searched on the net and could not find any info anywhere.
Anyhelp would be greatly appreciated.
I am using Ubuntu 11.04.
Thanks and Regards,
Harish Mahendrakar

Stehle, Vincent

unread,
May 24, 2011, 6:04:13 AM5/24/11
to panda...@googlegroups.com
On Mon, May 23, 2011 at 7:46 AM, Harish Mahendrakar <hari...@gmail.com> wrote:
How do i check at what frequency DDR2 is running at in pandaboard?

One way is to peek at PRCM registers values using e.g. devmem2 and compute the resulting frequencies with the TRM at hand.
 
And
also how do i check if L1 and L2 caches are enabled

You could use lmbench to measure that, for example with lat_mem_rd. The performances measurements should reflect the L1/L2/DDR hierarchy.
 
and what are cache
policies?

You may want to look at kernel code / CP15 registers for this one.

Best regards,

V.

Sergiy Kibrik

unread,
May 24, 2011, 6:11:05 AM5/24/11
to panda...@googlegroups.com, Stehle, Vincent
On 05/24/2011 01:04 PM, Stehle, Vincent wrote:

> On Mon, May 23, 2011 at 7:46 AM, Harish Mahendrakar <hari...@gmail.com <mailto:hari...@gmail.com>> wrote:
>
> How do i check at what frequency DDR2 is running at in pandaboard?
>
>
> One way is to peek at PRCM registers values using e.g. devmem2 and compute the resulting frequencies with the TRM at hand.

there've already been some discussion that might be helpful:
http://groups.google.com/group/pandaboard/browse_thread/thread/66f63260c305f1e8/9b5f2b8df4426067

I've used omap4_emif from http://elinux.org/Board_Bringup_Utilities sometime

--
regards,
Sergey

Harish Mahendrakar

unread,
May 25, 2011, 2:13:08 AM5/25/11
to pandaboard
Thanks Vincent,
Will look into CP15 registers or kernel code for cache policy info.

Thanks Sergey,
I used omap4_emif to compute DDR2 clock. It says 200MHz. Is there
anyway i can clock the DDR2 at 400MHz?
Regards,
Harish


On May 24, 3:11 pm, Sergiy Kibrik <sergiy.kib...@globallogic.com>
wrote:
> On 05/24/2011 01:04 PM, Stehle, Vincent wrote:
>
> > On Mon, May 23, 2011 at 7:46 AM, Harish Mahendrakar <haris...@gmail.com <mailto:haris...@gmail.com>> wrote:
>
> >     How do i check at what frequency DDR2 is running at in pandaboard?
>
> > One way is to peek at PRCM registers values using e.g. devmem2 and compute the resulting frequencies with the TRM at hand.
>
> there've already been some discussion that might be helpful:http://groups.google.com/group/pandaboard/browse_thread/thread/66f632...
>
> I've used omap4_emif fromhttp://elinux.org/Board_Bringup_Utilitiessometime
>
> --
> regards,
> Sergey

Måns Rullgård

unread,
May 25, 2011, 5:59:04 AM5/25/11
to panda...@googlegroups.com
Harish Mahendrakar <hari...@gmail.com> writes:

> I used omap4_emif to compute DDR2 clock. It says 200MHz. Is there
> anyway i can clock the DDR2 at 400MHz?

That tool prints half the actual clock frequency, so your DDR is being
clocked at 400MHz.

--
Måns Rullgård
ma...@mansr.com

Li Yi (Adam)

unread,
May 25, 2011, 8:24:32 AM5/25/11
to panda...@googlegroups.com
2011/5/25 Måns Rullgård <ma...@mansr.com>

>
> Harish Mahendrakar <hari...@gmail.com> writes:
>
> > I used omap4_emif to compute DDR2 clock. It says 200MHz. Is there
> > anyway i can clock the DDR2 at 400MHz?
>
> That tool prints half the actual clock frequency, so your DDR is being
> clocked at 400MHz.
>

But the STREAM benchmark shows the Pandaboard memory bandwidth is far
bellow DDR2-400 peak transfer rate: 3200MB/s:

Pandaboard (A1, ubuntu-11.04, L2-D prefetch by default on)
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 298.7389 0.1077 0.1071 0.1102
Scale: 299.0824 0.1074 0.1070 0.1078
Add: 264.1703 0.1820 0.1817 0.1823
Triad: 433.8945 0.1109 0.1106 0.1112

Pandaboard's score is half of Tegra (DDR2-667), although OMAP4 is dual-channel.
-------------------------------------------------------------
Copy: 733.1017 0.0441 0.0437 0.0451
Scale: 643.9154 0.0505 0.0497 0.0531
Add: 815.2425 0.0594 0.0589 0.0600
Triad: 728.9635 0.0666 0.0658 0.0674

So where is the bottleneck? Memory controller? L2 Cache controller? Bus?

-Yi

Siarhei Siamashka

unread,
May 25, 2011, 4:31:04 PM5/25/11
to panda...@googlegroups.com
On Wed, May 25, 2011 at 3:24 PM, Li Yi (Adam) <liyi...@gmail.com> wrote:
> But the STREAM benchmark shows the Pandaboard memory  bandwidth is far
> bellow DDR2-400 peak transfer rate: 3200MB/s:
>
> Pandaboard (A1, ubuntu-11.04, L2-D prefetch by default on)
> -------------------------------------------------------------
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:         298.7389       0.1077       0.1071       0.1102
> Scale:        299.0824       0.1074       0.1070       0.1078
> Add:          264.1703       0.1820       0.1817       0.1823
> Triad:        433.8945       0.1109       0.1106       0.1112
>
> Pandaboard's score is half of Tegra (DDR2-667), although OMAP4 is dual-channel.
> -------------------------------------------------------------
> Copy:         733.1017       0.0441       0.0437       0.0451
> Scale:        643.9154       0.0505       0.0497       0.0531
> Add:          815.2425       0.0594       0.0589       0.0600
> Triad:        728.9635       0.0666       0.0658       0.0674

As you have access to both pieces of hardware, could you please try
this benchmark one more time after compiling it with an extra
-fprefetch-loop-arrays gcc optimization option? I'm just curious and
wonder if it will affect relative Pandaboard/Tegra2 performance in
this particular benchmark.

--
Best regards,
Siarhei Siamashka

Måns Rullgård

unread,
May 25, 2011, 4:39:30 PM5/25/11
to panda...@googlegroups.com

The interface between the A9 and the memory controller is, let's be
diplomatic, suboptimal. This is compounded by ROM bugs on the early
OMAP4 revisions (all currently available) preventing optimal cache
configuration. The latter problem will be fixed in an upcoming silicon
revision, the former (allegedly) in the 4460 once it hits the streets.

--
Måns Rullgård
ma...@mansr.com

Siarhei Siamashka

unread,
May 25, 2011, 4:59:26 PM5/25/11
to panda...@googlegroups.com
2011/5/25 Måns Rullgård <ma...@mansr.com>:

Is this performance bottleneck related to Cortex-A9 r1p2 revision in
any way? Or something is suboptimal on the OMAP side (other than cache
configuration)? Or a bit of both?

And if I want to get some Cortex-A9 based device with the fastest
memory right now, do you have any recommendation which one would be a
better choice?

Måns Rullgård

unread,
May 25, 2011, 5:22:25 PM5/25/11
to panda...@googlegroups.com
Siarhei Siamashka <siarhei....@gmail.com> writes:

> 2011/5/25 Måns Rullgård <ma...@mansr.com>:
>> "Li Yi (Adam)" <liyi...@gmail.com> writes:
>>> So where is the bottleneck? Memory controller? L2 Cache controller? Bus?
>>
>> The interface between the A9 and the memory controller is, let's be
>> diplomatic, suboptimal.  This is compounded by ROM bugs on the early
>> OMAP4 revisions (all currently available) preventing optimal cache
>> configuration.  The latter problem will be fixed in an upcoming silicon
>> revision, the former (allegedly) in the 4460 once it hits the streets.
>
> Is this performance bottleneck related to Cortex-A9 r1p2 revision in
> any way? Or something is suboptimal on the OMAP side (other than cache
> configuration)? Or a bit of both?

The A9 itself is OK, from what I can tell. The problems are somewhere
in the interfacing between the MPU and DMM blocks. The IVAHD seems to
be getting all the bandwidth it needs.

> And if I want to get some Cortex-A9 based device with the fastest
> memory right now, do you have any recommendation which one would be a
> better choice?

I don't know which is the fastest. I don't even know all the A9
chips/boards coming out.

--
Måns Rullgård
ma...@mansr.com

Binwei Yang

unread,
May 27, 2011, 12:13:19 AM5/27/11
to panda...@googlegroups.com

If you try to decrease tegra2's memory frequency to 333MHz, you will see the similar BW.

"And if I want to get some Cortex-A9 based device with the fastest
memory right now, do you have any recommendation which one would be a
better choice?"

I didn't try but tegra2 does support 800MHz memory controller from their kernel code thought they said it only support 667MHz. So you may over clock it to 800MHz, the problem is whether memory support. Xoom doesn't, it seems AC100 does, just guess.


2011/5/25 Måns Rullgård <ma...@mansr.com>

Binwei Yang

unread,
May 27, 2011, 12:20:55 AM5/27/11
to panda...@googlegroups.com

But tegra2 only uses 1 master port. So it's expected that if TI fix the ROM issue, the BW should be higher than tegra2.

yi li

unread,
May 27, 2011, 12:51:49 AM5/27/11
to panda...@googlegroups.com
2011/5/26 Måns Rullgård <ma...@mansr.com>:

>
> The A9 itself is OK, from what I can tell.  The problems are somewhere
> in the interfacing between the MPU and DMM blocks.  The IVAHD seems to
> be getting all the bandwidth it needs.
>
Thanks Måns for the information. But do you know any official source
from TI for such HW issue?
I think this is critical for applications requires high memory bandwidth.
-Yi

yi li

unread,
May 27, 2011, 1:30:37 AM5/27/11
to panda...@googlegroups.com
Please see the result here on Pandaboard A1, using
-fprefetch-loop-array option, using 2M array size:

-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 294.0476 0.1092 0.1088 0.1096
Scale: 295.5407 0.1088 0.1083 0.1095
Add: 261.4465 0.1845 0.1836 0.1858
Triad: 427.6429 0.1125 0.1122 0.1127
-------------------------------------------------------------
I think the bottleneck is in HW.

Siarhei Siamashka

unread,
May 27, 2011, 2:09:15 AM5/27/11
to panda...@googlegroups.com

This does not look right to me. For the comparison, here are some
results from Pandaboard EA1 (RAM is clocked at just half of the
standard frequency!)

wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c

Compiling it with gcc 4.5.2 and different optimization options, I'm
getting the following results:

-O2


-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 236.6480 0.1361 0.1352 0.1364
Scale: 226.5248 0.1417 0.1413 0.1422
Add: 336.0119 0.1431 0.1429 0.1433
Triad: 338.0328 0.1449 0.1420 0.1644

-O2 -fprefetch-loop-arrays


-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 513.5024 0.0645 0.0623 0.0789
Scale: 561.3432 0.0572 0.0570 0.0575
Add: 325.9167 0.1475 0.1473 0.1477
Triad: 331.2698 0.1454 0.1449 0.1468

-O2 -fopenmp


-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 249.6628 0.1287 0.1282 0.1293
Scale: 249.9003 0.1284 0.1281 0.1288
Add: 593.5371 0.0813 0.0809 0.0820
Triad: 587.5524 0.0830 0.0817 0.0836

-O2 -fopenmp -fprefetch-loop-arrays


-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 250.3189 0.1280 0.1278 0.1282
Scale: 779.6104 0.0413 0.0410 0.0417
Add: 396.1905 0.1223 0.1212 0.1238
Triad: 530.2980 0.0908 0.0905 0.0914

-O2 -fopenmp -fprefetch-loop-arrays -funroll-loops


-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 324.4349 0.0992 0.0986 0.0995
Scale: 776.1525 0.0417 0.0412 0.0420
Add: 441.3228 0.1107 0.1088 0.1118
Triad: 431.6358 0.1126 0.1112 0.1136

So depending on gcc optimization options, we get some nice semi-random
benchmark numbers.

> I think the bottleneck is in HW.

Yes, I also think that the bottleneck is in the hardware. But
apparently you also have some additional problems on the software
side, which make performance even worse on your board.

yi li

unread,
May 27, 2011, 2:13:19 AM5/27/11
to panda...@googlegroups.com
On Thu, May 26, 2011 at 4:59 AM, Siarhei Siamashka
<siarhei....@gmail.com> wrote:

> And if I want to get some Cortex-A9 based device with the fastest
> memory right now, do you have any recommendation which one would be a
> better choice?
>

From my test using Tegra250 board (DDR2-667) and Pandaboard-A1
(LPDDR2-400), obviously Tegra250 has better memory bandwidth (given
STREAM benchmark).
But since the memory bandwidth on both boards are far bellow maximum
data transfer rate of their DDR2 memory chip, I don't think the
difference is caused by memory chip itself.

-Yi

yi li

unread,
May 27, 2011, 2:25:19 AM5/27/11
to panda...@googlegroups.com
On Fri, May 27, 2011 at 2:09 PM, Siarhei Siamashka
Siarhei, you are right.

I was using -O0 option, and using an old version of arm-gcc (gcc
version 4.3.3 (Sourcery G++ Lite 2009q1-203)).
When I switch to a newer version of arm-gcc (gcc version 4.5.1
(Sourcery G++ Lite 2010.09-50) ) and -O2, I got better score:

arm-none-linux-gnueabi-gcc -O2 -mfloat-abi=softfp -mcpu=cortex-a9
-march=armv7-a -fprefetch-loop-arrays -static stream.c -o stream.exe

-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 844.9550 0.0382 0.0379 0.0385
Scale: 763.7029 0.0421 0.0419 0.0423
Add: 419.7671 0.1146 0.1143 0.1149
Triad: 348.7509 0.1380 0.1376 0.1383
-------------------------------------------------------------
-Yi

Siarhei Siamashka

unread,
May 27, 2011, 2:41:16 AM5/27/11
to panda...@googlegroups.com

Yes, peak memory bandwidth can be estimated by running some tests with
the standard memset function from C programming language. It has some
really impressive performance on OMAP4 (~1.9GB/s on my EA1 board),
especially considering write-allocate cache policy configured for ARM
Cortex-A9 in the linux kernel.

By the way, has anyone seen the following article:
http://www.anandtech.com/show/4225/the-ipad-2-review/4 ? The most
interesting information there is the data for "Stdlib Write
(single-threaded scalar)" and "Stdlib Copy (single-threaded scalar)"
on Apple iPad 2. For me it looks all the same and already familiar
great memset and poor memcpy performance. That's why I was a bit
worried that the issue is not OMAP specific, but applies to all
Cortex-A9 processors (maybe just the initial r1pX revision) regardless
of SoC vendor.


And for the comparison, Intel Atom N450 with DDR2-667 shows ~1.5GB/s
performance for both memset and memcpy. After making sure that memset
is using MOVNTDQ instructions (memory writes are bypassing cache),
memset performance increases to ~3GB/s there, which gives an idea
about how write-allocate may hurt performance on such use cases.

yi li

unread,
May 27, 2011, 3:47:02 AM5/27/11
to panda...@googlegroups.com

STREAM result on Tegra250 with "-O2 -fprefetch-loop-arrays":


-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time

Copy: 1401.9101 0.0229 0.0228 0.0230
Scale: 1369.6285 0.0236 0.0234 0.0244
Add: 768.0492 0.0627 0.0625 0.0630
Triad: 745.5384 0.0645 0.0644 0.0647
-------------------------------------------------------------
-Yi

Måns Rullgård

unread,
May 27, 2011, 8:15:53 AM5/27/11
to panda...@googlegroups.com
Binwei Yang <binw...@gmail.com> writes:

> If you try to decrease tegra2's memory frequency to 333MHz, you will see the
> similar BW.

This 333MHz would be the DDR clock frequency, right?

> "And if I want to get some Cortex-A9 based device with the fastest
> memory right now, do you have any recommendation which one would be a
> better choice?"
>
> I didn't try but tegra2 does support 800MHz memory controller from their
> kernel code thought they said it only support 667MHz. So you may over clock
> it to 800MHz, the problem is whether memory support. Xoom doesn't, it seems
> AC100 does, just guess.

And these figures seem to be talking about the data rates, i.e. double
clock frequency.

The memory on the Tegra2 Harmony dev board is rated at tCK=2.5ns (400MHz),
but the last time I looked at it, the kernel would only clock it at 333MHz.

The OMAP4 runs at 400MHz.

--
Måns Rullgård
ma...@mansr.com

Måns Rullgård

unread,
May 27, 2011, 8:20:06 AM5/27/11
to panda...@googlegroups.com
Siarhei Siamashka <siarhei....@gmail.com> writes:

> On Fri, May 27, 2011 at 9:13 AM, yi li <liyi...@gmail.com> wrote:
>> On Thu, May 26, 2011 at 4:59 AM, Siarhei Siamashka
>> <siarhei....@gmail.com> wrote:
>>
>>> And if I want to get some Cortex-A9 based device with the fastest
>>> memory right now, do you have any recommendation which one would be a
>>> better choice?
>>>
>>
>> From my test using Tegra250 board (DDR2-667) and Pandaboard-A1
>> (LPDDR2-400), obviously Tegra250 has better memory bandwidth (given
>> STREAM benchmark).
>> But since the memory bandwidth on both boards are far bellow maximum
>> data transfer rate of their DDR2 memory chip, I don't think the
>> difference is caused by memory chip itself.
>
> Yes, peak memory bandwidth can be estimated by running some tests with
> the standard memset function from C programming language. It has some
> really impressive performance on OMAP4 (~1.9GB/s on my EA1 board),

The OMAP4 does show very fast writes. I don't know why.

> especially considering write-allocate cache policy configured for ARM
> Cortex-A9 in the linux kernel.

I toyed with turning that off, and it made little difference. The
reason it is there is for proper cache coherency between the two cores.

--
Måns Rullgård
ma...@mansr.com

Binwei Yang

unread,
May 27, 2011, 11:38:08 AM5/27/11
to panda...@googlegroups.com

here I all means MT/s. xoom run at 600MT/s, AC100 run at 667MT/s. But many documents said AC100 uses 800MT/s memory. I think it's possible. in linux kernel supplied by nvidia, they does use support 800MHz memory contoller, so if AC100 does uses 800MT/s memory, you can overclock memory controller there. OMAP4 uses LPDDR2 400MT/s memory. So it's expected OMAP4 has longer latency than tegra2.

Now on both platforms, their thoughput is limited by outstanding requests which is designed by ARM instead of memory controller. and tegra2 suffers more because they enabled address filter on PL310, so their BW is limited there. but they have to, that's their hw design. if you filter all requests to master0 on tegra2, you will be very very surprised on result.

I expect that if OMAP4 fixed their ROM issue, it should have higher BW from dual core. anyone has 4460 data? If OMAP uses 600MT/s memory, tegra2 has no way to catch up. ignore power.


2011/5/27 Måns Rullgård <ma...@mansr.com>

Måns Rullgård

unread,
May 27, 2011, 11:50:14 AM5/27/11
to panda...@googlegroups.com
Binwei Yang <binw...@gmail.com> writes:

>> Binwei Yang <binw...@gmail.com> writes:
>>
>> > If you try to decrease tegra2's memory frequency to 333MHz, you
>> > will see the similar BW.
>>
>> This 333MHz would be the DDR clock frequency, right?
>>
>> > "And if I want to get some Cortex-A9 based device with the fastest
>> > memory right now, do you have any recommendation which one would be a
>> > better choice?"
>> >
>> > I didn't try but tegra2 does support 800MHz memory controller from their
>> > kernel code thought they said it only support 667MHz. So you may over clock
>> > it to 800MHz, the problem is whether memory support. Xoom doesn't, it seems
>> > AC100 does, just guess.
>>
>> And these figures seem to be talking about the data rates, i.e. double
>> clock frequency.
>>
>> The memory on the Tegra2 Harmony dev board is rated at tCK=2.5ns (400MHz),
>> but the last time I looked at it, the kernel would only clock it at 333MHz.
>>
>> The OMAP4 runs at 400MHz.
>>

> here I all means MT/s. xoom run at 600MT/s, AC100 run at 667MT/s. But many
> documents said AC100 uses 800MT/s memory. I think it's possible. in linux
> kernel supplied by nvidia, they does use support 800MHz memory contoller, so
> if AC100 does uses 800MT/s memory, you can overclock memory controller
> there. OMAP4 uses LPDDR2 400MT/s memory. So it's expected OMAP4 has longer
> latency than tegra2.

The OMAP4 DDR clock goes up to 400MHz or 800MT/s.

--
Måns Rullgård
ma...@mansr.com

Binwei Yang

unread,
May 27, 2011, 11:54:08 AM5/27/11
to panda...@googlegroups.com

you mean 4460 will use LPDDR2 800MT/s?


2011/5/27 Måns Rullgård <ma...@mansr.com>

Måns Rullgård

unread,
May 27, 2011, 12:42:22 PM5/27/11
to panda...@googlegroups.com
Binwei Yang <binw...@gmail.com> writes:

> you mean 4460 will use LPDDR2 800MT/s?

No, I mean the 4430 has that. May I suggest you read the datasheets?

--
Måns Rullgård
ma...@mansr.com

Siarhei Siamashka

unread,
May 27, 2011, 1:10:31 PM5/27/11
to panda...@googlegroups.com
On Fri, May 27, 2011 at 10:47 AM, yi li <liyi...@gmail.com> wrote:
> On Fri, May 27, 2011 at 2:25 PM, yi li <liyi...@gmail.com> wrote:
>> On Fri, May 27, 2011 at 2:09 PM, Siarhei Siamashka
>> <siarhei....@gmail.com> wrote:
>>> On Fri, May 27, 2011 at 8:30 AM, yi li <liyi...@gmail.com> wrote:
>>>> On Thu, May 26, 2011 at 4:31 AM, Siarhei Siamashka
>>>> <siarhei....@gmail.com> wrote:
>>>>> On Wed, May 25, 2011 at 3:24 PM, Li Yi (Adam) <liyi...@gmail.com> wrote:
>>>>>> But the STREAM benchmark shows the Pandaboard memory  bandwidth is far
>>>>>> bellow DDR2-400 peak transfer rate: 3200MB/s:
>>>>>>
>>>>>> Pandaboard (A1, ubuntu-11.04, L2-D prefetch by default on)
>>>>>> -------------------------------------------------------------
>>>>>> Function      Rate (MB/s)   Avg time     Min time     Max time
>>>>>> Copy:         298.7389       0.1077       0.1071       0.1102
>>>>>> Scale:        299.0824       0.1074       0.1070       0.1078
>>>>>> Add:          264.1703       0.1820       0.1817       0.1823
>>>>>> Triad:        433.8945       0.1109       0.1106       0.1112
>>>>>>
>>>>>> Pandaboard's score is half of Tegra (DDR2-667), although OMAP4 is dual-channel.
>>>>>> -------------------------------------------------------------
>>>>>> Copy:         733.1017       0.0441       0.0437       0.0451
>>>>>> Scale:        643.9154       0.0505       0.0497       0.0531
>>>>>> Add:          815.2425       0.0594       0.0589       0.0600
>>>>>> Triad:        728.9635       0.0666       0.0658       0.0674

<snip>

>> When I switch to a newer version of arm-gcc (gcc version 4.5.1
>> (Sourcery G++ Lite 2010.09-50) )  and -O2, I got better score:
>>
>> arm-none-linux-gnueabi-gcc -O2 -mfloat-abi=softfp  -mcpu=cortex-a9
>> -march=armv7-a  -fprefetch-loop-arrays -static  stream.c -o stream.exe
>>
>> -------------------------------------------------------------
>> Function      Rate (MB/s)   Avg time     Min time     Max time
>> Copy:         844.9550       0.0382       0.0379       0.0385
>> Scale:        763.7029       0.0421       0.0419       0.0423
>> Add:          419.7671       0.1146       0.1143       0.1149
>> Triad:        348.7509       0.1380       0.1376       0.1383
>> -------------------------------------------------------------
>> -Yi
>
> STREAM result on Tegra250 with "-O2 -fprefetch-loop-arrays":
> -------------------------------------------------------------
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:        1401.9101       0.0229       0.0228       0.0230
> Scale:       1369.6285       0.0236       0.0234       0.0244
> Add:          768.0492       0.0627       0.0625       0.0630
> Triad:        745.5384       0.0645       0.0644       0.0647
> -------------------------------------------------------------

OK, thanks for sharing this information. So until proven otherwise, we
can safely assume that none of the ARM Cortex-A9 based systems has a
usable hardware prefetcher at the moment.

Måns Rullgård

unread,
May 27, 2011, 1:30:58 PM5/27/11
to panda...@googlegroups.com
Siarhei Siamashka <siarhei....@gmail.com> writes:

> So until proven otherwise, we can safely assume that none of the ARM
> Cortex-A9 based systems has a usable hardware prefetcher at the
> moment.

What do you mean by hardware prefetcher? The A9+PL310 has several kinds
of prefetching which can be used:

- Dual line fetch, which fetches two cache lines at a time from RAM with
a configurable distance between them.
- Automatic speculative prefetch based on current access patterns. Off
by default.
- Explicit prefetch with PLD instruction.
- L2 preload engine.

Some of these features are unavailable or partially available on current
silicon, but this should improve in the near future.

--
Måns Rullgård
ma...@mansr.com

Binwei Yang

unread,
May 27, 2011, 2:53:06 PM5/27/11
to panda...@googlegroups.com

From the CTT tool, i see EMIF run at 400MHz and DDRPHY run at 200MHz. tegra2's EMC run at 667MHz or 600MHz. Do you mean OMAP4430 has higher memory frequency than Tegra2? Any TI can can confirm that? It's the first time I heard this.


2011/5/27 Måns Rullgård <ma...@mansr.com>

Siarhei Siamashka

unread,
May 29, 2011, 9:56:58 AM5/29/11
to panda...@googlegroups.com
2011/5/27 Måns Rullgård <ma...@mansr.com>:

> Siarhei Siamashka <siarhei....@gmail.com> writes:
>
>> So until proven otherwise, we can safely assume that none of the ARM
>> Cortex-A9 based systems has a usable hardware prefetcher at the
>> moment.
>
> What do you mean by hardware prefetcher?

I mean the generally accepted concept, which exists even outside ARM
and TI/OMAP scope. There are a number of more or less authoritative
sources, and you can easily find them: "Intel(R) 64 and IA-32
Architectures Optimization Reference Manual", "G5 Performance
Programming", Ulrich Drepper's "What Every Programmer Should Know
About Memory", etc.

Basically it's the thing, which eliminates or greatly reduces the need
for doing software prefetching in the code with easily predictable
memory access patterns (a simple sequential access for example). Such
as this STREAM benchmark.

Måns Rullgård

unread,
May 29, 2011, 12:32:48 PM5/29/11
to panda...@googlegroups.com
Siarhei Siamashka <siarhei....@gmail.com> writes:

Had you spent a little more time reading and less being condescending
you might have noticed, in the part of my message you cut, that the A9
does have automatic prefetching. It is currently disabled, by mistake,
on the OMAP4 GP devices. On OMAP4 HS, and presumably other A9-based
chips, it is available.

--
Måns Rullgård
ma...@mansr.com

Siarhei Siamashka

unread,
May 29, 2011, 1:37:29 PM5/29/11
to panda...@googlegroups.com
2011/5/29 Måns Rullgård <ma...@mansr.com>:

> Siarhei Siamashka <siarhei....@gmail.com> writes:
>
>> 2011/5/27 Måns Rullgård <ma...@mansr.com>:
>>> Siarhei Siamashka <siarhei....@gmail.com> writes:
>>>
>>>> So until proven otherwise, we can safely assume that none of the ARM
>>>> Cortex-A9 based systems has a usable hardware prefetcher at the
>>>> moment.
>>>
>>> What do you mean by hardware prefetcher?
>>
>> I mean the generally accepted concept, which exists even outside ARM
>> and TI/OMAP scope. There are a number of more or less authoritative
>> sources, and you can easily find them: "Intel(R) 64 and IA-32
>> Architectures Optimization Reference Manual", "G5 Performance
>> Programming", Ulrich Drepper's "What Every Programmer Should Know
>> About Memory", etc.
>>
>> Basically it's the thing, which eliminates or greatly reduces the need
>> for doing software prefetching in the code with easily predictable
>> memory access patterns (a simple sequential access for example). Such
>> as this STREAM benchmark.
>
> Had you spent a little more time reading and less being condescending
> you might have noticed, in the part of my message you cut, that the A9
> does have automatic prefetching.

Thank you, KO. Had you spent a little more time reading what I have
written in my message, you might have noticed the emphasis on "usable"
and "at the moment". The information from Li Yi with the benchmark
data from Tegra2 was actually useful, and provides a hint that
hardware prefetch is also not enabled there "by mistake" or for
whatever other reason. And a part of my older message that you also
decided to cut and avoid commenting, contained the link to some
benchmarks which indicated that iPad 2 (presumably ARM Cortex-A9
based) does not have perfect memory performance either.

Måns Rullgård

unread,
May 29, 2011, 2:26:08 PM5/29/11
to panda...@googlegroups.com
Siarhei Siamashka <siarhei....@gmail.com> writes:

Maybe the kernel is simply not enabling it. It is off on reset. It
appears that you have read neither the manuals nor the kernel source,
just making bold statements in a rude tone as always.

> And a part of my older message that you also decided to cut and avoid
> commenting, contained the link to some benchmarks which indicated that
> iPad 2 (presumably ARM Cortex-A9 based) does not have perfect memory
> performance either.

That is the least of the problems with Apple products.

--
Måns Rullgård
ma...@mansr.com

Siarhei Siamashka

unread,
May 29, 2011, 3:28:47 PM5/29/11
to panda...@googlegroups.com

Maybe. I would be very interested to see somebody try it and confirm
that it is actually working by posting relevant benchmark results.

> It appears that you have read neither the manuals nor the kernel source,
> just making bold statements in a rude tone as always.

My "bold" statement basically boils down to "seeing is believing".
What's wrong about it?

Stehle, Vincent

unread,
May 30, 2011, 3:34:55 AM5/30/11
to panda...@googlegroups.com
On Fri, May 27, 2011 at 8:53 PM, Binwei Yang <binw...@gmail.com> wrote:
From the CTT tool, i see EMIF run at 400MHz and DDRPHY run at 200MHz.

Hi,

Are you sure about the readings? You should rather have EMIF clk (200 MHz) = DDR PHY clk (400 MHz) / 2.

Best regards,

V.

Harish Mahendrakar

unread,
Jun 3, 2011, 2:53:15 AM6/3/11
to pandaboard
I printed Auxilary control register(http://infocenter.arm.com/help/
index.jsp?topic=/com.arm.doc.ddi0388f/Cjaibafe.html) value for
Pandaboard and i am seeing that both L1 prefetch and L2 prefetch are
disabled. Value read was 0x41.
I tried to enable L1 prefetch, but i am seeing an undefined
instruction error. TRM says CP15SDISABLE should be low to write to
auxilary control register. Any clues on how to set this CP15SDISABLE
to low?
In omap3 i was able to update Auxilary control register in bootloader
in secureworld_exit() in arch/arm/cpu/armv7/omap3/board.c
(http://git.linaro.org/gitweb?p=boot/u-boot-linaro-
stable.git;a=blob;f=arch/arm/cpu/armv7/omap3/
board.c;h=6c2a132b63bf2147b4acde3f9877ab56270f54e8;hb=HEAD)
But for omap4 i do not see such a function in arch/arm/cpu/armv7/omap4/
board.c. I tried calling secureworld_exit() from s_init, but the board
hangs while loadind u-boot.bin.
Where in bootloader can i add code to enable L1 prefetch?
Regards,
Harish Mahendrakar

On May 30, 12:34 pm, "Stehle, Vincent" <v-ste...@ti.com> wrote:

Woodruff, Richard

unread,
Jun 3, 2011, 8:34:11 PM6/3/11
to panda...@googlegroups.com

>Maybe. I would be very interested to see somebody try it and confirm
>that it is actually working by posting relevant benchmark results.
>
>> It appears that you have read neither the manuals nor the kernel source,
>> just making bold statements in a rude tone as always.

There are prefetchers at a9mpcore and PL310 level. If you turn them on (ES2.2 or 2.3 GP or any EMU device with PPA patch) you will see benchmark improvements. If you just look at memcpy they are positive. There is much more to work loads then >L2 size memcpy's so YMWV.

There are errata scattered around on both. And what you can safely use does depend on which A9 and PL310 rev you are using.

The 4460 does have newer versions of both. The newer PL310's also offer more knobs (like double line fill). Samples do show a good improvement as expected. Probably it will be a while before others can get at them to verify.

The experiments on GP2.1 Panda are limited due to security blocking registers. Other phones or tables should be around soon for expanded trials. Playbook is out there with a newer revision but that is not Linux.

Regards,
Richard W.

Måns Rullgård

unread,
Jun 4, 2011, 6:36:03 AM6/4/11
to panda...@googlegroups.com
"Woodruff, Richard" <r-woo...@ti.com> writes:

>>Maybe. I would be very interested to see somebody try it and confirm
>>that it is actually working by posting relevant benchmark results.
>>
>>> It appears that you have read neither the manuals nor the kernel source,
>>> just making bold statements in a rude tone as always.
>
> There are prefetchers at a9mpcore and PL310 level. If you turn them
> on (ES2.2 or 2.3 GP or any EMU device with PPA patch) you will see
> benchmark improvements. If you just look at memcpy they are positive.
> There is much more to work loads then >L2 size memcpy's so YMWV.

Is the ROM fix enabling full control of the A9 aux control and PL310
registers in ES2.2? Someone said it wouldn't be until ES2.3, and I've
heard nothing authoritative.

--
Måns Rullgård
ma...@mansr.com

Woodruff, Richard

unread,
Jun 4, 2011, 9:39:49 AM6/4/11
to panda...@googlegroups.com
>Is the ROM fix enabling full control of the A9 aux control and PL310
>registers in ES2.2? Someone said it wouldn't be until ES2.3, and I've
>heard nothing authoritative.

It should be in 2.2GP based on old plan. I didn't check post release. I've been using HS for a bit so didn't try it on GP first hand.

Pl310 Aux: R12=109, R0=value
PL310 POR: R12=113, R0=value

The 2.3 fix list I know of doesn't list these. So I assume it made 2.2.

Regards,
Richard W.

Måns Rullgård

unread,
Jun 4, 2011, 10:09:39 AM6/4/11
to panda...@googlegroups.com
"Woodruff, Richard" <r-woo...@ti.com> writes:

FWIW, the latest TRM revision says it's available from 2.2.

--
Måns Rullgård
ma...@mansr.com

Reply all
Reply to author
Forward
0 new messages