Question on cycle based Counter inside OMAP/ARM

919 views
Skip to first unread message

Samuel

unread,
Mar 23, 2010, 5:37:27 AM3/23/10
to Beagle Board
On X86, there are Time Stamp Counter (TSC) which can be read from user
mode to get the elapse cycles easily. TSC counter will simply increase
ONE on every cycle.

Is there any similar cycle counter inside OMAP or ARM? It will be very
useful to monitor performance of my application w/o need of Oprofile.

Thanks!

Samuel

Michael Zucchi

unread,
Mar 23, 2010, 6:18:42 AM3/23/10
to beagl...@googlegroups.com
There are the performance counters - i think you can enable user-land
access to them via the kernel, but i don't know how, i only used them
without linux so I don't know how far it's support goes (e.g. for
context switches).

There are 4 of them, and they can be programmed to count all manner of
quite interesting things, from cache misses to predicted branches to
the simple passage of time or instructions. There is also a simple
cycle counter which just counts up, and can be pre-scaled by 64 for
longer periods.

See DDI0406B: "Arm TRM ARMv7-A and ARMv7-R edition", Chapter C9 -
Performance Monitors.

!Z

> --
> You received this message because you are subscribed to the Google Groups "Beagle Board" group.
> To post to this group, send email to beagl...@googlegroups.com.
> To unsubscribe from this group, send email to beagleboard...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beagleboard?hl=en.
>
>

Samuel

unread,
Mar 23, 2010, 6:57:20 AM3/23/10
to Beagle Board
Performance counter is too heavy for most usage and it is used for
detailed performance tuning.
If there is a similar counter like X86's TSC(increase 1 on every
machine cycle,like watch's tick and readable globally), it will be
very convenient. Since user only need to read the TSC register
directly. 2 TSC value's difference is simply what's the elapse cycles
between 2 TSC reading.

Any more information?

Samuel

On Mar 23, 6:18 pm, Michael Zucchi <not...@gmail.com> wrote:
> There are the performance counters - i think you can enable user-land
> access to them via the kernel, but i don't know how, i only used them
> without linux so I don't know how far it's support goes (e.g. for
> context switches).
>
> There are 4 of them, and they can be programmed to count all manner of
> quite interesting things, from cache misses to predicted branches to
> the simple passage of time or instructions.  There is also a simple
> cycle counter which just counts up, and can be pre-scaled by 64 for
> longer periods.
>
> See DDI0406B: "Arm TRM ARMv7-A and ARMv7-R edition", Chapter C9 -
> Performance Monitors.
>
>  !Z
>

Michael Zucchi

unread,
Mar 23, 2010, 7:04:12 AM3/23/10
to beagl...@googlegroups.com
On 23 March 2010 21:27, Samuel <samuel....@gmail.com> wrote:
> Performance counter is too heavy for most usage and it is used for
> detailed performance tuning.

Err? It is? If you don't need detailed performance tuning use gettimeofday().

> If there is a similar counter like X86's TSC(increase 1 on every
> machine cycle,like watch's tick and readable globally), it will be
> very convenient. Since user only need to read the TSC register
> directly. 2 TSC value's difference is simply what's the elapse cycles
> between 2 TSC reading.

Well like i said in my reply, there is a simple cycle counter too.

> Any more information?

Try reading the manuals and using google.

Måns Rullgård

unread,
Mar 23, 2010, 7:26:32 AM3/23/10
to beagl...@googlegroups.com
Michael Zucchi <not...@gmail.com> writes:

> On 23 March 2010 21:27, Samuel <samuel....@gmail.com> wrote:
>> Performance counter is too heavy for most usage and it is used for
>> detailed performance tuning.
>
> Err? It is?

Of course not. The performance counters have no overhead at all.

> If you don't need detailed performance tuning use gettimeofday().

gettimeofday() is much more expensive. Reading the Cortex-A8 cycle
counters with an MRC instruction takes 50 cycles while a call to
gettimeofday() takes about 1000 cycles.

>> If there is a similar counter like X86's TSC(increase 1 on every
>> machine cycle,like watch's tick and readable globally), it will be
>> very convenient. Since user only need to read the TSC register
>> directly. 2 TSC value's difference is simply what's the elapse cycles
>> between 2 TSC reading.
>
> Well like i said in my reply, there is a simple cycle counter too.
>
>> Any more information?
>
> Try reading the manuals and using google.

The ARM ARM and Cortex-A8 TRM would be good reading.

Here is a patch to make the counters available directly from userspace:
http://git.mansr.com/?p=linux-omap;a=commitdiff;h=5170038

--
Måns Rullgård
ma...@mansr.com

Samuel

unread,
Mar 23, 2010, 8:05:01 AM3/23/10
to Beagle Board
Thanks Måns Rullgård and Michael Zucchi !
Yes, gettimeofday() needs much more cycles and granularity might be
too big.
Could you share me more one how to use MRC instruction to read some
cycle counter? e.g. which counter, how? 50 cycle overhead looks ok
for me.

BTW, I must clarify that "heavy" in my context is that :usage step is
not very easy, since I must communicate PMU via kernel module, the
usage mode is some how "heavy", which need more code and debugging
than read register directly from user space..... I know the overhead
of PMU is light. :)

I prefer some counter not lived inside PMU, while if there isn't any
choice beside PMU, I will try PMU cycle counter. For the patch of
usage space visiting of PMU, need I re-compile kernel? or it is
already up-streamed? Is there any usage example?

Thanks again!

Samuel


On Mar 23, 7:26 pm, Måns Rullgård <m...@mansr.com> wrote:
> Michael Zucchi <not...@gmail.com> writes:

> m...@mansr.com

Samuel

unread,
Mar 23, 2010, 8:21:58 AM3/23/10
to Beagle Board
I also found some discuss at http://blog.gmane.org/gmane.linux.ports.arm.general/month=20080801
to read CCNT of PXA processor.
It seems CCNT of PMU is a nice and only choice.
Appreciate any guide to read CCNT based on your patch.

BTW, if I didn't have chance of re-build kernel for missing of
customized kernel source , is it possible to write a driver (.ko) to
read CCNT and how?

Samuel

Gabi Voiculescu

unread,
Mar 23, 2010, 8:37:32 AM3/23/10
to beagl...@googlegroups.com
Hi.

Yes you can use the fixed function cycle counter CCNT in your OMAP application.

You could use one of the programable function counters, programmed with the function of counting cycles as well to have multiple hardware sources, but my experience says it depends on OMAP silicon (http://e2e.ti.com/support/arm174_microprocessors/omap_applications_processors/f/42/p/38720/135485.aspx#135485).

Documentation:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/DDI0344J_cortex_a8_r3p2_trm.pdf

What you want is around page 210 (the mrc/mcr commands involved).

You basically need the following:
- program bit 0 in USEREN reg from the kernel to enable userland access to the counters
- program PMNC bit 0 to enable counter hardware
- program CNTENS bit 31 to enable CCNT counting
- potentially also program CNTENC bit 31 to disable CCNT counting
- unmask iPMU_IRQ on your OMAP
- read CCNT to get the current 32bit hardware value of the counter

Gabi Voiculescu

--- On Tue, 3/23/10, Samuel <samuel....@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "Beagle Board" group.
To post to this group, send email to beagl...@googlegroups.com.
To unsubscribe from this group, send email to beagleboard+unsub...@googlegroups.com.

Måns Rullgård

unread,
Mar 23, 2010, 9:20:52 AM3/23/10
to beagl...@googlegroups.com
Samuel <samuel....@gmail.com> writes:

> On Mar 23, 7:26 pm, Måns Rullgård <m...@mansr.com> wrote:
>> Michael Zucchi <not...@gmail.com> writes:
>> > On 23 March 2010 21:27, Samuel <samuel.xu.t...@gmail.com> wrote:
>> >> Performance counter is too heavy for most usage and it is used for
>> >> detailed performance tuning.
>>
>> > Err?  It is?
>>
>> Of course not.  The performance counters have no overhead at all.
>>
>> > If you don't need detailed performance tuning use gettimeofday().
>>
>> gettimeofday() is much more expensive.  Reading the Cortex-A8 cycle
>> counters with an MRC instruction takes 50 cycles while a call to
>> gettimeofday() takes about 1000 cycles.
>>
>> >> If there is a similar counter like X86's TSC(increase 1 on every
>> >> machine cycle,like watch's tick and readable globally), it will be
>> >> very convenient. Since user only need to read the TSC register
>> >> directly. 2 TSC value's difference is simply what's the elapse cycles
>> >> between 2 TSC reading.
>>
>> > Well like i said in my reply, there is a simple cycle counter too.
>>
>> >> Any more information?
>>
>> > Try reading the manuals and using google.
>>
>> The ARM ARM and Cortex-A8 TRM would be good reading.
>>
>> Here is a patch to make the counters available directly from userspace:http://git.mansr.com/?p=linux-omap;a=commitdiff;h=5170038
>

> Thanks Måns Rullgård and Michael Zucchi !
> Yes, gettimeofday() needs much more cycles and granularity might be
> too big.
> Could you share me more one how to use MRC instruction to read some
> cycle counter? e.g. which counter, how? 50 cycle overhead looks ok
> for me.
>
> BTW, I must clarify that "heavy" in my context is that :usage step is
> not very easy, since I must communicate PMU via kernel module, the
> usage mode is some how "heavy", which need more code and debugging
> than read register directly from user space..... I know the overhead
> of PMU is light. :)
>
> I prefer some counter not lived inside PMU, while if there isn't any
> choice beside PMU, I will try PMU cycle counter. For the patch of
> usage space visiting of PMU, need I re-compile kernel? or it is
> already up-streamed? Is there any usage example?

To use the cycle counter from userspace, apply the patch linked above,
enable the new config option and rebuild the kernel. In your app, use
these functions to access the counter:

static inline void ccnt_start(void)
{
__asm__ volatile ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1<<31));
}

static inline void ccnt_stop(void)
{
__asm__ volatile ("mcr p15, 0, %0, c9, c12, 2" :: "r"(1<<31));
}

static inline unsigned ccnt_read(void)
{
unsigned cc;
__asm__ volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r"(cc));
return cc;
}

static inline void ccnt_init(void)
{
ccnt_stop();
__asm__ volatile ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5));
}

If you don't stop the counter after you're done with it, oprofile will
be unhappy, should you wish to use it. Needless to say, using this
while oprofile is running is not a good idea.

--
Måns Rullgård
ma...@mansr.com

Reply all
Reply to author
Forward
0 new messages