On Mon, 24 Jun 2013 08:06:30 -0700 (PDT) Vasant wrote:
> On Sunday, June 23, 2013 1:39:18 PM UTC-7, Siarhei Siamashka wrote:
> >
> > There is one more interesting thing. Allwinner A20 seems to only have
> > 256K of L2 cache, which is shared between two Cortex-A7 cores. But if
> > it had 512K of shared L2 cache, then single-threaded workloads on
> > Allwinner A20 could have L2 cache size advantage over Allwinner A10
> > and mitigate the weaker core performance penalty. Too bad that this
> > has not happened.
>
> The A20 page on linux sunxi indicates that L2 cache is 512KB. Is there any
> way to check on the device since the available documentation is non
> existent ?. I agrre a 256KB L2 cache will severely limit performance.
The L2 cache size can be measured by running various benchmarking
tools. I'm using
https://github.com/ssvb/tinymembench for this purpose.
When run from the current default CubieBoard2 firmware image (but
with 1GHz clock frequency enabled via cpufreq), it reports the
following:
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with total 3 requests to SDRAM for almost every ==
== memory access (though 64MiB is not large enough to experience this ==
== effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : read access time (single random read / dual random read)
2 : 0.0 ns / 0.0 ns
4 : 0.0 ns / 0.0 ns
8 : 0.0 ns / 0.0 ns
16 : 0.0 ns / 0.0 ns
32 : 0.0 ns / 0.0 ns
64 : 0.0 ns / 0.0 ns
128 : 0.0 ns / 0.0 ns
256 : 0.0 ns / 0.0 ns
512 : 0.0 ns / 0.0 ns
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 6.2 ns / 10.8 ns
131072 : 9.7 ns / 15.1 ns
262144 : 13.3 ns / 19.6 ns
524288 : 113.8 ns / 178.6 ns
1048576 : 168.8 ns / 232.4 ns
2097152 : 203.6 ns / 258.0 ns
4194304 : 221.6 ns / 269.4 ns
8388608 : 233.2 ns / 278.2 ns
16777216 : 245.2 ns / 292.7 ns
33554432 : 263.4 ns / 325.1 ns
67108864 : 298.4 ns / 394.8 ns
The average latency of random memory read accesses done inside of
512KB buffer is significantly larger than the average latency for
the 256KB buffer. It means that there is either 256KB of the L2 cache,
or the CPU is doing some clever partitioning of the cache and allowing
each core to allocate only half of the L2 cache lines (but allow any
core to use any data that has been already pulled into the L2 cache).
However for Allwinner A31, the same test shows a sharp latency increase
for the buffer sizes bigger than 1MB, which confirms 1MB of the shared
L2 cache in A31 (and matches the specs). All of the other experiments
(benchmarks done with two threads) also indicate that there is only
256KB of shared L2 cache in Allwinner A20.
However Cortex-A7 is still a bit better than Cortex-A8 in terms of the
effective cache size:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/BABJECJI.html
"Data is only allocated to the L2 cache when evicted from the L1 memory
system, not when first fetched from the system. The L1 cache can
prefetch data from the system without data being evicted from the L2
cache."
It looks like an exclusive cache architecture or some variation
of it. And the effective size for Cortex-A7 is the sum of L1 and L2
caches (32KB + 256KB).
In the case of Cortex-A8, cache misses pull data into both L1 and L2
(evicting some other useful data from the L2 cache to free space). Only
NEON can actually bypass the L1 cache and work directly with L2,
allowing to have non-duplicated data in L1 and L2 (pull data into the
L1 cache using ARM instructions and into the L2 cache using NEON
instructions).