EMIF performance counter for bandwidth measurement

JW

unread,

Mar 12, 2011, 11:03:03 PM3/12/11

to pandaboard

Hi,

I am working on OMAP4430(panda board) and trying to use the EMIF
performance counters to measure the DDR bandwidth. The registers I
used are EMIF_PERF_CNT_1, EMIF_PERF_CNT_2,EMIF_PERF_CNT_TIM, and
EMIF_PERF_CNT_TIM. Looks there are two ways (described below) to
measure the bandwidth based on the TRM, but they DO NOT agree with
each other. Can you please help explain it? The two ways I think of
are following. In addition, what is the relationship between EMIF
clock, EMIF_L3_ICLK, EMIF_FCLK, and DDR_CLK?

1. Get the counter total read and counter total write (CNTR_CFG=2 and
3 respectively), and multiply them by 8 (burst length) and then by 4
(32 bit DDR data bus).

2. Get the count number of EMIF clock for tranferring data
(CNTR_CFG=10, or 0x0A), and multiply it by 16 (because of the 128 bit
L3 system bus).

Thanks.

prpplague

unread,

Mar 13, 2011, 8:05:45 PM3/13/11

to pandaboard

JW,

have a look at the example test programs located here:

http://www.elinux.org/Board_Bringup_Utilities

Dave

Måns Rullgård

unread,

Mar 13, 2011, 8:35:09 PM3/13/11

to panda...@googlegroups.com

JW <hous...@gmail.com> writes:

> Hi,
>
> I am working on OMAP4430(panda board) and trying to use the EMIF
> performance counters to measure the DDR bandwidth. The registers I
> used are EMIF_PERF_CNT_1, EMIF_PERF_CNT_2,EMIF_PERF_CNT_TIM, and
> EMIF_PERF_CNT_TIM. Looks there are two ways (described below) to
> measure the bandwidth based on the TRM, but they DO NOT agree with
> each other. Can you please help explain it? The two ways I think of
> are following. In addition, what is the relationship between EMIF
> clock, EMIF_L3_ICLK, EMIF_FCLK, and DDR_CLK?

EMIF_FCLK is half of DDR_CLK. EMIF_L3_ICLK is not (necessarily)
synchronous with these.

> 1. Get the counter total read and counter total write (CNTR_CFG=2 and
> 3 respectively), and multiply them by 8 (burst length) and then by 4
> (32 bit DDR data bus).

Is the full burst length always used? I'd hope so, for normal cases,
but the memory performance seen from the A9 is so poor that I have my
doubts.

> 2. Get the count number of EMIF clock for tranferring data
> (CNTR_CFG=10, or 0x0A), and multiply it by 16 (because of the 128 bit
> L3 system bus).

The TRM says this about that setting:

Count number of EMIF clock cycles for which the memory data bus was
transferring data.

The way I read this, it counts EMIF cycles with data on the DDR bus, in
other words the counter increments on every second DDR bus cycle with
data. This seems to agree with measurements as well.

--
Måns Rullgård
ma...@mansr.com

Binwei Yang

unread,

Mar 13, 2011, 8:48:34 PM3/13/11

to panda...@googlegroups.com, Måns Rullgård

"Is the full burst length always used? I'd hope so, for normal cases,
but the memory performance seen from the A9 is so poor that I have my
doubts. "

I'm also interested in this. The memory BW we measured is very very poor compared to Tegra2. Any clue? Is there misconfiguration here?

thanks

Binwei

2011/3/14 Måns Rullgård <ma...@mansr.com>

Rob Clark

unread,

Mar 13, 2011, 8:53:51 PM3/13/11

to panda...@googlegroups.com, Binwei Yang, Måns Rullgård

2011/3/13 Binwei Yang <binw...@gmail.com>:

>
> "Is the full burst length always used? I'd hope so, for normal cases,
> but the memory performance seen from the A9 is so poor that I have my
> doubts. "
> I'm also interested in this. The memory BW we measured is very very poor
> compared to Tegra2. Any clue? Is there misconfiguration here?

yes, sort of.. although proper configuration is not really possible
(missing ROM code API) on es2.1 gp devices (ie. all A1 pandas)

BR,
-R

JW

unread,

Mar 14, 2011, 2:29:34 PM3/14/11

to pandaboard

Hi Dave,

Thank you. Actually my testing code was based on these bring up
utilities you pointed. The question comes to how to interpret the
results. :)

JW

JW

unread,

Mar 14, 2011, 2:38:04 PM3/14/11

to pandaboard

Hi Mans,

As to the burst length, the OMAP4430 TRM does not have a good
description. But I checked the TRM of OMAP-L1x, which looks like
having a same EMIF block and performance counter filter. There it
described the READ/WRITE command counter as:

Counts the total number of READ commands (read accesses)
the EMIFB receives.
Counter increments for transfers aligned to the default burst size
(DBS) are equal to the transfer size divided by the DBS.

The default burst size is 8. So I assume we can use this counter to
measure the READ or WRITE bytes, as the method #1 I mentioned.
However, the results are different from the method #2 (by EMIF clock
of data transfer). If I transferred a known number of bytes, the
result of method #2 matched. If method #2 is trusted, then we have to
come to the conclusion that the burst size is not always 8 and there
is not way to the READ and WRITE bandwidth separately. What do you
think?

JW

On Mar 13, 8:35 pm, Måns Rullgård <m...@mansr.com> wrote:

> m...@mansr.com

Antti P Miettinen

unread,

Mar 24, 2011, 4:07:31 AM3/24/11

to panda...@googlegroups.com

Rob Clark <robd...@gmail.com> writes:
> 2011/3/13 Binwei Yang <binw...@gmail.com>:
>>
>> "Is the full burst length always used? I'd hope so, for normal cases,
>> but the memory performance seen from the A9 is so poor that I have my
>> doubts. "
>> I'm also interested in this. The memory BW we measured is very very poor
>> compared to Tegra2. Any clue? Is there misconfiguration here?
>
> yes, sort of.. although proper configuration is not really possible
> (missing ROM code API) on es2.1 gp devices (ie. all A1 pandas)

Can you elaborate? A pointer to errata? For example lmbench does indeed
give some quite poor memory performance results. I was sort of looking
forward to see how wonderfull the new memory subsystem would be compared
to OMAP3..

--
http://www.iki.fi/~ananaza/

Binwei Yang

unread,

Apr 8, 2011, 4:51:27 AM4/8/11

to panda...@googlegroups.com, Antti P Miettinen

core0 can't push the whole memory BW. e.g. if we only use core 0, we get 400MB/s BW, if we use core 0 and core 1, we get 600MB/s BW. please not here isn't the real result.
and it also shows much higher memory latency than tegra2. almost double

if there is no way, memory system will be a big bottleneck to omap4430. Doesn't TI know this?

Binwei Yang

unread,

Apr 29, 2011, 2:19:46 AM4/29/11

to panda...@googlegroups.com

I think Rob means the address filter on PL310. Address filter in SCU can be accessed but it can't be on PL310.

From TRM, there is a local interconnect between PL310 and L3/memory controller. Not sure whether memory requests are routed to memory controllers. If it does then there is no way to do optimization. How PL310's two master ports connects to their localinterconnect and route to L3/memory controller is the key.

Binwei Yang

unread,

Apr 29, 2011, 2:27:04 AM4/29/11

to panda...@googlegroups.com

OK, from their TRM: Two 64-bit master ports, one to L3 and one to EMIF

If their chart is correct that master 0 connects EMIF and master 1 connects to L3, we have no way to configure PL310's address filter register because it only give a way to route a range of memory to master 1, but OMAP4's memory address isn't start from 0.

Stehle, Vincent

unread,

May 2, 2011, 6:07:52 AM5/2/11

to panda...@googlegroups.com

On Fri, Apr 29, 2011 at 8:27 AM, Binwei Yang <binw...@gmail.com> wrote:

OK, from their TRM: Two 64-bit master ports, one to L3 and one to EMIF

If their chart is correct that master 0 connects EMIF and master 1 connects to L3, we have no way to configure PL310's address filter register because it only give a way to route a range of memory to master 1, but OMAP4's memory address isn't start from 0.

Hi,

FYI, the connection between PL310 outputs and EMIF/L3 is not direct. Rather, this goes through an AXI2OCP module:

    __________      _______      _________
            |--->|       |--->|         |---> (to DMM/EMIF)
    Cortex-A9 |    | PL310 |    | AXI2OCP |---> (to L3)
            |--->|       |--->|         |---> (to ABE)
    ----------      -------      ---------

Best regards,

V.

Reply all

Reply to author

Forward