The realistic Cortex-A7 clock frequency limit for Allwinner A20 (CubieBoard2)?

4,947 views
Skip to first unread message

Siarhei Siamashka

unread,
Jun 22, 2013, 4:33:06 PM6/22/13
to linux...@googlegroups.com
Hello,

Currently CubieBoard2 is shipped with the CPU clock frequency limit
set to 912MHz and voltage at 1.4V

I tried to run some tests with different CPU clock frequencies and
voltages. A very useful test program is 7-zip (packaged as p7zip
in debian/ubuntu). It can be run as "7za b -mmt=4" to stress
multi-threaded LZMA compression/decompression. This is originally
a benchmark, but it is also able to detect system instability.
If the decompressed result does not match the original data, an
error message will be shown.

For the cpuburn test I used the following program:
https://github.com/ssvb/cpuburn/blob/e96f9c7d56b2/cpuburn-a7.S

Below are the power consumption numbers, using a multimeter to
measure the current draw from the 5V barrel power connector. First
is the current default setup:

=== CubieBoard2 default, CPU clock frequency 912MHz (VDD=1.400V) ===

idle: ~250mA
7zip benchmark: ~450mA
cpuburn: ~680mA

Next is the configuration hinted at http://tinyurl.com/n279mdk
(https://github.com/cubieboard2/linux-sunxi) via the following
defines:

#define CPU_EXTREMITY_FREQ 1008000 /* cpu extremity frequency: 1008M */
#define SYS_VDD_EXTREMITY 1300000 /* sys vdd: 1.3V */

=== CubieBoard2, CPU clock frequency 1GHz (VDD=1.300V) ===

idle: ~250mA
7zip benchmark: ~430mA
cpuburn: ~640mA


And finally is the list of configurations with the minimal voltage (tested
with 0.025V steps) for each clock frequency, which allows to run 7zip
benchmark without failures or deadlocks:

=== CubieBoard2, CPU clock frequency 912MHz (VDD=1.200V) ===

idle: ~240mA
7zip benchmark: ~380mA
cpuburn: ~540mA

=== CubieBoard2, CPU clock frequency 1GHz (VDD=1.275V) ===

idle: ~250mA
7zip benchmark: ~420mA
cpuburn: ~620mA

=== CubieBoard2, CPU clock frequency 1.1GHz (VDD=1.375V) ===

idle: ~260mA
7zip benchmark: ~500mA
cpuburn: ~750mA

For the last configuration I only tried to run cpuburn for a few seconds
before stopping it, and additionally installed a small heatsink on top of
the Allwinner A20 chip.

I wonder if the other boards show similar results? But disclaim any
responsibility in the case if somebody tries to reproduce the test
and burns his board. If you are not sure, don't try this at home :)

--
Best regards,
Siarhei Siamashka

Oliver Schinagl

unread,
Jun 23, 2013, 8:14:23 AM6/23/13
to linux...@googlegroups.com
So now we'd need someone who's willing to run an A20, without heatsink,
at 1.1Ghz @ 1.375 (or 1.4 even as that's default anyway) for prolonged
time. Say 24 hour full stress test.

I assume same test could/should be done for A10 as well.

Hans de Goede

unread,
Jun 23, 2013, 8:52:27 AM6/23/13
to linux...@googlegroups.com, Oliver Schinagl
Hi,
No, tests like these are useless. Unless someone goes out and buys a
sample of at least 60 A20 equipped boards, over a longer period of time,
with various pcb designs (3 per pcb design, so 20 variants), and then
runs a mixed workload for a prolonged period of time.

Testing 1 or 2 socs is useless, just like with intel cpu-s most will be
able to work fine on higher frequencies. What we need to know is the
worst case scenario.

Given that it is in Allwinner's best interest to not make their socs
run unnecessary slow / at a too high voltage, we can safely assume that
the settings we are getting from them are what is needed to make the
worst case scenario work, and thus we should stick with them.

Example, when we first started adding A10s / axp152 support to u-boot,
the author of the first patch had likely done similar testing and happily
made u-boot set a Vcore of 1.2 volt (instead of the 1.4 volt the android
images are using), this worked fine on his board, and fine on mine, but
olimex reported that *some* of their boards would not work (reliable)
unless the voltage was raised to 1.4v.

I believe the linux-sunxi kernels don't have this, but later allwinner
kernel sources contain code to read a different voltage table for dvfs
from the fex file. This can be an interesting feature to have for people
who want to tweak things for their own board. But the stock linux-sunxi
kernels should stick to the "recommended" values.

Regards,

Hans

Iso9660

unread,
Jun 23, 2013, 10:33:10 AM6/23/13
to linux...@googlegroups.com
Applying heat (with a 50W light bulb and covering with a blanket), monitoring temperatures, and logging the board would help to know heat resistance and stable clock speed of different boards.

Siarhei Siamashka

unread,
Jun 23, 2013, 12:20:54 PM6/23/13
to linux...@googlegroups.com, hdeg...@redhat.com, Oliver Schinagl
It depends. Considering that we still don't have any trustworthy
official information about the supported clock frequency, running at
least some tests is good for satisfying curiosity.

A simple reproducible test can give us some preliminary information
about the variation in the clock frequency / voltage tolerance for
different devices. For example, I wonder if my chip is "good", "bad"
or "average" in this respect?

In any case, I decided to share the results of my measurements. Hope
this information was useful for some people :)

> Testing 1 or 2 socs is useless, just like with intel cpu-s most will be
> able to work fine on higher frequencies. What we need to know is the
> worst case scenario.

Yes, of course. I myself would not run the processor at the very minimum
voltage which just happened to pass a simple test. Some safety margin
is surely needed (I would probably add at least 0.1V to be on a safe
side). Or a clear unambiguous information from the vendor would sort
out the whole mess.

> Given that it is in Allwinner's best interest to not make their socs
> run unnecessary slow / at a too high voltage, we can safely assume that
> the settings we are getting from them are what is needed to make the
> worst case scenario work, and thus we should stick with them.

Which setting exactly are we getting from them? As I mentioned before,
at least one file from the Allwinner specific cpufreq sources refers to:

#define CPU_EXTREMITY_FREQ 1008000 /* cpu extremity frequency: 1008M */
#define SYS_VDD_EXTREMITY 1300000 /* sys vdd: 1.3V */

This seems to imply that 1GHz at 1.3V is considered "extremity" by
somebody from Allwinner.

> Example, when we first started adding A10s / axp152 support to u-boot,
> the author of the first patch had likely done similar testing and happily
> made u-boot set a Vcore of 1.2 volt (instead of the 1.4 volt the android
> images are using), this worked fine on his board, and fine on mine, but
> olimex reported that *some* of their boards would not work (reliable)
> unless the voltage was raised to 1.4v.

Please have a look at A10_Datasheet.pdf

The section "6.1. Absolute Maximum Ratings" says that the absolute
maximum for VDD is 1.3V, and the section "6.2. Recommended Operating
Conditions" says that the recommended VDD is 1.2V

Yet we are currently running Allwinner A10 at 1GHz with 1.4V. Has it
turned out to be the real maximum operating voltage for A10?

For comparison, please also have a look at the "Recommended Operating
Conditions" from http://www.ti.com/lit/ds/symlink/omap3530.pdf
(Cortex-A8 based SoC from TI). It talks about "overdrive" operating
conditions with 1.35V (instead of 1.27V) roughly halving the estimated
chip lifetime. But that's the case of 100K power-on hours versus 44K or
50K (more than 10 years of working non-stop or "just" 5 years).

Now just a random speculation. If cranking up the voltage and clock
frequency for Allwinner A10 would result in the expected life time
reduction to let's say just 2-5 years, would Allwinner do this for
the sake of providing competitive performance? If I understand it
correctly, selecting the maximum voltage is a trade off between the
performance and chip life time.

Getting back to Allwinner A20, we can look at the currently shipping
consumer devices. For example Mele M5, which is currently advertised to
run at 1.2GHz. I wonder what kind of voltage settings are they using?
And will it successfully survive cpuburn-a7 test?

> I believe the linux-sunxi kernels don't have this, but later allwinner
> kernel sources contain code to read a different voltage table for dvfs
> from the fex file. This can be an interesting feature to have for people
> who want to tweak things for their own board. But the stock linux-sunxi
> kernels should stick to the "recommended" values.

I believe that's how the current A20 sources work. I mean the ones from
https://github.com/cubieboard2/linux-sunxi

John S

unread,
Jun 23, 2013, 12:44:26 PM6/23/13
to linux...@googlegroups.com
>From: Siarhei Siamashka <siarhei....@gmail.com>
>Getting back to Allwinner A20, we can look at the currently shipping
>consumer devices. For example Mele M5, which is currently advertised to
>run at 1.2GHz. I wonder what kind of voltage settings are they using?
>And will it successfully survive cpuburn-a7 test?

My experience is that vendors often say 1.2GHz but actually they use the 1008 setting.  Over here in England that's contrary to law but I doubt anyone will do anything and anyway it doesn't help much with what the sunxi kernel should do.  1008 looks OK as a default.

On the 1.2V / 1.4V: it would be better if we knew we could rely on the datasheet being accurate!  Failing that, Olimex and the like are maybe the best we have and so 1.4V looks OK.  Hopefully.

Interesting results, though.

John

Siarhei Siamashka

unread,
Jun 23, 2013, 4:39:18 PM6/23/13
to linux...@googlegroups.com, johns9...@yahoo.co.uk
Mele seems to be a reputable vendor. And the older Mele A1000/A2000
were correctly labelled as 1GHz. But it would be best if we have some
Mele owners with Allwinner A20 here, who could share the information.

I'm worried about the clock frequency, because at 912MHz (and even
at 1GHz) the newer Allwinner A20 based devices are going to have
worse single-threaded performance than older Allwinner A10. The
multi-threaded workloads are getting a good improvement, but still
it would be nicer if Allwinner A20 was a drop-in replacement with
preferably no regressions.

Cortex-A7 has some advantages over Cortex-A8:
- pipelined VFP (Cortex-A8 has a very slow non-pipelined VFPLite)
- the pipeline is not getting stalled on cache misses
- TLB size increased to 256 (only 32 for Cortex-A8)
- only 8 cycles branch prediction penalty (13 cycles for Cortex-A8)
- hardware division instructions
- faster integer multiplications and a speed up for a few other
originally multi-cycle instructions

However Cortex-A7 also has some rather major performance regressions:
- NEON performance is more than twice worse, it can only load/store
or do arithmetics with 64 bits of data per cycle (Cortex-A8 can
handle 128 bits per cycle). Also it can't dual issue pairs of NEON
instructions (Cortex-A8 can dual issue load/store/permute with
arithmetic instructions). And even ARM instructions can't dual
issue with NEON instructions (Cortex-A8 has no problems with it).
- In the integer pipeline, dual issue opportunities are very limited.
They are only restricted to branches and instructions with immediate
operands (Cortex-A8 can dual issue practically any LSU/ALU or
ALU/ALU pair as long as there is no data dependency).

I would say that Cortex-A8 has almost twice better theoretical
peak data crunching performance if used with an ideal compiler
or perfect hand written assembly. The peak single precision
floating point performance is also good if using the NEON unit in
Cortex-A8. But if NEON is not used for floating point calculations,
then gets very slow with VFP. In a nutshell, Cortex-A8 is rewarding
for those who spend efforts extracting the peak performance from it.

Cortex-A7 is dumbed down, but is less challenging for optimizing
compilers. So higher percentage of the peak performance can be
achieved automatically via compiler without resorting to hand
written assembly. Also it deals a bit better with memory heavy
and branch heavy code.

In any case, Cortex-A7 still performs worse than Cortex-A8 per MHz.
So clocking Allwinner A20 higher than 1GHz would be very nice.

There is one more interesting thing. Allwinner A20 seems to only have
256K of L2 cache, which is shared between two Cortex-A7 cores. But if
it had 512K of shared L2 cache, then single-threaded workloads on
Allwinner A20 could have L2 cache size advantage over Allwinner A10
and mitigate the weaker core performance penalty. Too bad that this
has not happened.

Vasant

unread,
Jun 24, 2013, 11:06:30 AM6/24/13
to linux...@googlegroups.com, johns9...@yahoo.co.uk
The A20 page on linux sunxi indicates that L2 cache is 512KB. Is there any way to check on the device since the available documentation is non existent ?. I agrre a 256KB L2 cache will severely limit performance.

Regards
Vasant

Siarhei Siamashka

unread,
Jun 24, 2013, 10:14:50 PM6/24/13
to linux...@googlegroups.com, in...@microxwin.com, johns9...@yahoo.co.uk
On Mon, 24 Jun 2013 08:06:30 -0700 (PDT) Vasant wrote:
> On Sunday, June 23, 2013 1:39:18 PM UTC-7, Siarhei Siamashka wrote:
> >
> > There is one more interesting thing. Allwinner A20 seems to only have
> > 256K of L2 cache, which is shared between two Cortex-A7 cores. But if
> > it had 512K of shared L2 cache, then single-threaded workloads on
> > Allwinner A20 could have L2 cache size advantage over Allwinner A10
> > and mitigate the weaker core performance penalty. Too bad that this
> > has not happened.
>
> The A20 page on linux sunxi indicates that L2 cache is 512KB. Is there any
> way to check on the device since the available documentation is non
> existent ?. I agrre a 256KB L2 cache will severely limit performance.

The L2 cache size can be measured by running various benchmarking
tools. I'm using https://github.com/ssvb/tinymembench for this purpose.

When run from the current default CubieBoard2 firmware image (but
with 1GHz clock frequency enabled via cpufreq), it reports the
following:

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with total 3 requests to SDRAM for almost every ==
== memory access (though 64MiB is not large enough to experience this ==
== effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================

block size : read access time (single random read / dual random read)
2 : 0.0 ns / 0.0 ns
4 : 0.0 ns / 0.0 ns
8 : 0.0 ns / 0.0 ns
16 : 0.0 ns / 0.0 ns
32 : 0.0 ns / 0.0 ns
64 : 0.0 ns / 0.0 ns
128 : 0.0 ns / 0.0 ns
256 : 0.0 ns / 0.0 ns
512 : 0.0 ns / 0.0 ns
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 6.2 ns / 10.8 ns
131072 : 9.7 ns / 15.1 ns
262144 : 13.3 ns / 19.6 ns
524288 : 113.8 ns / 178.6 ns
1048576 : 168.8 ns / 232.4 ns
2097152 : 203.6 ns / 258.0 ns
4194304 : 221.6 ns / 269.4 ns
8388608 : 233.2 ns / 278.2 ns
16777216 : 245.2 ns / 292.7 ns
33554432 : 263.4 ns / 325.1 ns
67108864 : 298.4 ns / 394.8 ns

The average latency of random memory read accesses done inside of
512KB buffer is significantly larger than the average latency for
the 256KB buffer. It means that there is either 256KB of the L2 cache,
or the CPU is doing some clever partitioning of the cache and allowing
each core to allocate only half of the L2 cache lines (but allow any
core to use any data that has been already pulled into the L2 cache).
However for Allwinner A31, the same test shows a sharp latency increase
for the buffer sizes bigger than 1MB, which confirms 1MB of the shared
L2 cache in A31 (and matches the specs). All of the other experiments
(benchmarks done with two threads) also indicate that there is only
256KB of shared L2 cache in Allwinner A20.

However Cortex-A7 is still a bit better than Cortex-A8 in terms of the
effective cache size:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/BABJECJI.html
"Data is only allocated to the L2 cache when evicted from the L1 memory
system, not when first fetched from the system. The L1 cache can
prefetch data from the system without data being evicted from the L2
cache."
It looks like an exclusive cache architecture or some variation
of it. And the effective size for Cortex-A7 is the sum of L1 and L2
caches (32KB + 256KB).

In the case of Cortex-A8, cache misses pull data into both L1 and L2
(evicting some other useful data from the L2 cache to free space). Only
NEON can actually bypass the L1 cache and work directly with L2,
allowing to have non-duplicated data in L1 and L2 (pull data into the
L1 cache using ARM instructions and into the L2 cache using NEON
instructions).

Siarhei Siamashka

unread,
Jun 24, 2013, 10:44:26 PM6/24/13
to linux...@googlegroups.com, johns9...@yahoo.co.uk
On Sun, 23 Jun 2013 17:44:26 +0100 (BST)
John S <johns9...@yahoo.co.uk> wrote:

> On the 1.2V / 1.4V: it would be better if we knew we could rely on
> the datasheet being accurate!  Failing that, Olimex and the like
> are maybe the best we have and so 1.4V looks OK.  Hopefully.

Appears that there is actually some A20 datasheet available here:

https://github.com/OLIMEX/OLINUXINO/tree/master/HARDWARE/A20-PDFs

Both absolute maximum and recommended VDD voltages are specified
to be 1.4V for Allwinner A20.

nils...@gmail.com

unread,
Dec 25, 2013, 7:28:32 AM12/25/13
to linux...@googlegroups.com
воскресенье, 23 июня 2013 г., 0:33:06 UTC+4 пользователь Siarhei Siamashka написал:
I have Mele M3(A20/1Gb), passed gcc 4.7 compilation(3 stage) on Gentoo with :
=== CPU clock frequency 1GHz (VDD=1.275V), DRAM 480 ===
+
vm.overcommit_memory = 2
vm.overcommit_ratio = 100

With default 1.4V it segfaults.
Thanks for you post.

Oliver Schinagl

unread,
Dec 27, 2013, 5:56:11 AM12/27/13
to linux...@googlegroups.com
On 12/25/13 13:28, nils...@gmail.com wrote:
> О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫, 23 О©╫О©╫О©╫О©╫ 2013 О©╫., 0:33:06 UTC+4 О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫ Siarhei Siamashka О©╫О©╫О©╫О©╫О©╫О©╫О©╫:
Do you mean you are UNDERvolting it? that seems a little strange ...

> Thanks for you post.
>

nils...@gmail.com

unread,
Dec 27, 2013, 9:47:42 AM12/27/13
to linux...@googlegroups.com, olive...@schinagl.nl
пятница, 27 декабря 2013 г., 14:56:11 UTC+4 пользователь Oliver Schinagl написал:
Yes,I think that without heatsink, compilation segfaults due to overheating .

1,008/1.4V - segfaults .
1,008/1.275V - gcc compiled ok, takes about 5 hours.
1,056/1.325V - system freezes .
Back to 1,008Gz/1.275V - gcc compiled ok again.

Also i have Mele A100(A10/512) with heatsink, gcc compilation allways segfaults on it with any freq ,probably due to overclocked(480) ram.

Michal Suchanek

unread,
Dec 27, 2013, 4:41:48 PM12/27/13
to linux-sunxi, Oliver Schinagl
On 27 December 2013 15:47, <nils...@gmail.com> wrote:
> пятница, 27 декабря 2013 г., 14:56:11 UTC+4 пользователь Oliver Schinagl написал:
>> On 12/25/13 13:28, nils....@gmail.com wrote:
>>
>> > О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫, 23 О©╫О©╫О©╫О©╫ 2013 О©╫., 0:33:06 UTC+4 О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫ Siarhei Siamashka О©╫О©╫О©╫О©╫О©╫О©╫О©╫:
>>
>> >
>>
>> > With default 1.4V it segfaults.
>>
>> Do you mean you are UNDERvolting it? that seems a little strange ...
>>
>
> Yes,I think that without heatsink, compilation segfaults due to overheating .
>
> 1,008/1.4V - segfaults .
> 1,008/1.275V - gcc compiled ok, takes about 5 hours.
> 1,056/1.325V - system freezes .
> Back to 1,008Gz/1.275V - gcc compiled ok again.
>
> Also i have Mele A100(A10/512) with heatsink, gcc compilation allways segfaults on it with any freq ,probably due to overclocked(480) ram.
>

It seems that AW is lacking in the factory testing. It's not
surprising that different pieces of silicone turn out differently but
they don't even have space for part-specific voltage table and they
obviously don't test that the parts work according to spec. There
actually isn't any spec other than 'if it works for you then it's
within the spec'. They don't market parts with less than maximum
frequency either. There is not even certified frequency. When you tell
them it does not work with the frequency in the sales flyer they tell
you to try and clock it lower ..

ok, enough ranting. Obviously the chips would cost twice as much if
they had a spec and nobody would buy them at that price.

Thanks for sharing your results

Michal

Siarhei Siamashka

unread,
Jan 1, 2014, 6:11:31 PM1/1/14
to linux...@googlegroups.com, nils...@gmail.com, olive...@schinagl.nl
It seems like you are either having extremely bad luck (2 bad devices
out of 2 is impressive), or something is really messed up on the
software side.

I have 4 devices with Allwinner-A10/Allwinner-A20. All of them are
working fine when used with the normal default configuration (the
unmodified linux sunxi-3.4 kernel, sunxi u-boot and fex files).

nils...@gmail.com

unread,
Jan 2, 2014, 8:35:03 AM1/2/14
to linux...@googlegroups.com, nils...@gmail.com, olive...@schinagl.nl
четверг, 2 января 2014 г., 3:11:31 UTC+4 пользователь Siarhei Siamashka написал:
> On Fri, 27 Dec 2013 06:47:42 -0800 (PST)
>
Hardware without box, like cubie, may perform better as it better cooled on open air. I have tested Mele M3 as is(in box, and without heatsink), so overheating comes quicker.
Yes, with default 912Mhz/1.4V it probably works ok, but i am not tried this.
For now with 1,008Ghz/1.275V and governor "perfomance", it absolutely stable. Tried compilation of gcc-gentoo-4.7.3-r1, gcc-linaro-4.6.4, gcc-linaro-4.7.4, sunxi-3.0, sunxi-3.4 and other code sources with no errors.
Ordered 20x20x10mm heatsink, i think A20 can more when better cooled.

As for A10(Mele A100 with heatsink), my problem was OVERvolting that i used previously, eg 1.200Ghz/1.65V .
Get it stable with 1.200Ghz/1.5V. No segfaults more.
Tested with sunxi-3.4.75/Gentoo, sunxi-3.0.101/ICS.

Just for information - A10 stable minimal freq/volt combinations i've found on my hardware :
{.freq = 1200000000, .volt = 1500},
{.freq = 1152000000, .volt = 1450},
{.freq = 1104000000, .volt = 1400},
{.freq = 1056000000, .volt = 1350},
{.freq = 1008000000, .volt = 1300},
Reply all
Reply to author
Forward
0 new messages