DRAM controller performance in Allwinner A20

774 views
Skip to first unread message

Siarhei Siamashka

unread,
Jun 30, 2013, 8:53:44 AM6/30/13
to linux...@googlegroups.com, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
Hello,

Right now the bootloader from the CubieBoard2 NAND firmware image
(cb_a20_ubn_12.04_x-v1.02-dram480.img) configures DRAMC not in exactly
the same way as u-boot from https://github.com/hno/u-boot/tree/wip/a20

The a10-meminfo tool reports the following information for both:

dram_clk = 480
dram_type = 3
dram_rank_num = 1
dram_chip_density = 4096
dram_io_width = 16
dram_bus_width = 32
dram_cas = 9
dram_zq = 0x7f
dram_odt_en = 0
dram_tpr0 = 0x42d899b7
dram_tpr1 = 0xa090
dram_tpr2 = 0x22a00
dram_tpr3 = 0x0
dram_emr1 = 0x4
dram_emr2 = 0x10
dram_emr3 = 0x0

But there are some other differences if the whole set of DRAMC hardware
registers is dumped. For example, ccr (controller configuration
register) is set to 0x00004020 by the CubieBoard2 default firmware and
to 0x00004000 by hno's u-boot branch. This is not the only change,
there are a bunch of others.

The performance differences are visible if we run benchmarks. For
example, memset shows ~2GB/s performance with the CubieBoard2 firmware
and just ~1.6GB/s with https://github.com/hno/u-boot/tree/wip/a20

Are the sources of the bootloader from the CubieBoard2 available
somewhere (the part, which initializes the DRAM controller)?
Using a trial and error method to guess the right way to end up
with the same settings is not very appealing.

There is also one more potential performance problem. The latency of
DRAM accesses seems to be ~210ns for Allwinner A20. For comparison,
Allwinner A10 has ~165ns latency when using the same CAS9 timings,
or ~145ns with CAS6 timings. The latency of L2 cache is around or
slightly more than 10 cycles for both Cortex-A7 in Allwinner A20
and Cortex-A8 in Allwinner A10. So the source of the memory latency
increase seems to be somewhere on the way from the DDR3 chips to
the L2 cache.

Can anything be done to improve this situation? Thanks.

I also went ahead and added some Allwinner people to CC (the
participants of the "[linux-sunxi] [PATCHv2 0/8] clocksource:
sunxi: Timer fixes and cleanup" thread). I hope that they
don't mind, and possibly could help by sharing some missing
bits of information (as long as it does not involve disclosing
something confidential).

--
Best regards,
Siarhei Siamashka

Oliver Schinagl

unread,
Jun 30, 2013, 9:27:48 AM6/30/13
to linux...@googlegroups.com, Siarhei Siamashka, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
The memory controller is still far from well understood. A lot of things
happens in magic (values). When luke ported a20 from boot0 to u-boot-spl
he overlooked/skipped a few things. These appearantly are only
performance things.

https://github.com/oliv3r/u-boot-sunxi/tree/wip/a20

is my u-boot WIP tree (warning, very volatile) and should build and run
with all things we have from the boot0 tree that was released recently.
Of course many other possible optimizations are still unknown, as are a
few (not many) missing register (names). As such we know nothing of the
'zqcr[01]' register, very little of the dllcr registers, mcr misses bit
28's meaning and the first 8 bits, the drr register is a mystery (dram
refresh rate?), iocr reg and of course the tpr and emr (timing para?)

Which makes it very hard to support or improve :(

oliver

Siarhei Siamashka

unread,
Jun 30, 2013, 6:40:11 PM6/30/13
to Siarhei Siamashka, linux...@googlegroups.com, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
This particular problem turned out to be caused by wrong clock
frequency for MBUS (thanks to jemk on #linux-sunxi irc channel for
spotting this).

https://github.com/hno/u-boot/tree/wip/a20 uses the following code:

/* setup MBUS clock */
reg_val = (0x1 << 31) | (0x2 << 24) | (0x1);
writel(reg_val, &ccm->mbus_clk_cfg);

While https://github.com/hno/allwinner-boot/blob/lichee-a20-dev/boot0
does it as:

//setup MBUS clock
reg_val = (0x1U<<31) | (0x1<<24) | (0x1<<0) | (0x1<<16);
mctl_write_w(DRAM_CCM_MUS_CLK_REG, reg_val);

Which results in 240MHz vs. 300MHz MBUS clock speed (and 1.6GB/s vs.
2GB/s memset performance difference).

Still there are the some differences in the other registers, and we
don't really know whether they are important or not.

Siarhei Siamashka

unread,
Jun 30, 2013, 7:13:16 PM6/30/13
to Oliver Schinagl, linux...@googlegroups.com, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
I see, thanks for the explanation. As I understand it now, there are a
bunch of different variants of boot0 sources on github (targeting
different Allwinner hardware and maybe originating from different code
drops).

It still would be nice to find the exact sources of the CubieBoard2
boot0 just to see if there is anything else important that we need to
take care of.

> https://github.com/oliv3r/u-boot-sunxi/tree/wip/a20
>
> is my u-boot WIP tree (warning, very volatile) and should build and run
> with all things we have from the boot0 tree that was released recently.
> Of course many other possible optimizations are still unknown, as are a
> few (not many) missing register (names). As such we know nothing of the
> 'zqcr[01]' register, very little of the dllcr registers, mcr misses bit
> 28's meaning and the first 8 bits, the drr register is a mystery (dram
> refresh rate?), iocr reg and of course the tpr and emr (timing para?)

I really appreciate your efforts and the efforts of the people porting
DRAMC initialization from boot0 to u-boot, despite all the difficulties.
But this activity also looks a bit dangerous to me. How can we be sure
that "all things we have from the boot0 tree that was released
recently" are really "all things" and nothing important has been
forgotten?

Was it not feasible to attempt minimizing the differences between
the original dramc initializations code from boot0 and its u-boot
counterpart? If this part of hardware is poorly understood and
undocumented, why not taking the original code mostly "as is"
without risky modification? I'm not really suggesting anything,
just curious.

> Which makes it very hard to support or improve :(

Yeah, looks pretty grim.

Oliver Schinagl

unread,
Jun 30, 2013, 7:38:39 PM6/30/13
to linux...@googlegroups.com, Siarhei Siamashka, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
We don't, we can only compare and see :(
>
> Was it not feasible to attempt minimizing the differences between
> the original dramc initializations code from boot0 and its u-boot
> counterpart? If this part of hardware is poorly understood and
> undocumented, why not taking the original code mostly "as is"
> without risky modification? I'm not really suggesting anything,
> just curious.
We haven't done any minimzation steps yet, since we don't know what's
needed and what not.

mbus and GPS are excellent examples.

Why does our sunxi-current set mbus for all socs (sun4i and sun5i) when
sun4i doesn't allow you to change those registers.

Or why does sun4i need the GPS to be setup for DRAM init?!?!

Many questions about the odities of the dramc.

Siarhei Siamashka

unread,
Jul 9, 2013, 3:42:31 PM7/9/13
to Oliver Schinagl, linux...@googlegroups.com, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
The biggest question for me right now is the use of CAS Latency 9.

Shouldn't it be 6 for 400MHz and 7 for anything between 400MHz and
533MHz? The DRAM datasheet [1] says "The timing specification of high
speed bin is backward compatible with low speed bin". The CubieBoard2
is using the chips from the "-BG" speed bin (DDR3-1333-CL9). But if I
understand it correctly, they should be compatible with the timings
for "-BF" (DDR3-1066-CL7)?

The CAS latency alone does affect random memory access latency.
This affects the performance of lookups in large hash tables and
some other use cases. The performance difference is probably far
from ground breaking, but it would be a pity to waste it.

There is also CAS Write Latency (CWL), which probably needs to be
set right for reliable operation. But I hope somebody took care of
it already somewhere in the magic constants. And a bunch of other
options. The Altera paper about optimizing memory controllers [2]
seems to be quite interesting to read.

Anyway, it is not like I can do anything about this matter. I will
just wait a few more weeks, and then just consider whatever used
by CubieBoard2 as the officially approved memory timings :-)


1. http://dl.linux-sunxi.org/chips/GT-DDR3-2Gbit-B-DIE-(X8,X16).pdf
2. http://www.altera.com/literature/hb/external-memory/emi_optimizing_efficiency.pdf

Henrik Nordström

unread,
Jul 14, 2013, 11:33:53 AM7/14/13
to linux...@googlegroups.com, Siarhei Siamashka, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com, Tom Cubie
mån 2013-07-01 klockan 01:40 +0300 skrev Siarhei Siamashka:

> Still there are the some differences in the other registers, and we
> don't really know whether they are important or not.

Can you map out the differences please?

Regards
Henrik

Siarhei Siamashka

unread,
Sep 19, 2013, 9:14:30 AM9/19/13
to Siarhei Siamashka, linux...@googlegroups.com, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
On Mon, 1 Jul 2013 01:40:11 +0300
Appears that in practice MBUS can be clocked much higher than that.
Some days ago Turl tried an extreme overclocking to 600MHz (possibly
because of assuming that the base clock was PLL6 and not PLL6*2). And
the system did not immediately fail:

http://irclog.whitequark.org/linux-sunxi/2013-09-14#4956716;

I also tried to run some tests for different MBUS clock frequencies on
my Cubieboard2. Here are the results for the memory write speed and the
latencies of random memory accesses inside of large memory blocks:

=== Allwinner A20, 1GHz, MBUS clock = 300 MHz, DRAM clock = 480MHz ===

NEON fill bandwidth = 2015.6 MB/s

block size : single random read / dual random read
262144 : 13.6 ns / 19.4 ns
524288 : 98.3 ns / 153.6 ns
1048576 : 145.0 ns / 199.9 ns
2097152 : 175.7 ns / 222.6 ns
4194304 : 191.3 ns / 232.0 ns
8388608 : 201.3 ns / 239.9 ns
16777216 : 211.6 ns / 252.8 ns
33554432 : 226.9 ns / 278.9 ns
67108864 : 254.5 ns / 333.2 ns

=== Allwinner A20, 1GHz, MBUS clock = 400 MHz, DRAM clock = 480MHz ===

NEON fill bandwidth = 2689.5 MB/s

block size : single random read / dual random read
262144 : 13.4 ns / 19.1 ns
524288 : 90.3 ns / 140.5 ns
1048576 : 133.2 ns / 182.1 ns
2097152 : 161.6 ns / 203.0 ns
4194304 : 176.4 ns / 211.7 ns
8388608 : 186.4 ns / 219.1 ns
16777216 : 196.0 ns / 231.7 ns
33554432 : 210.6 ns / 257.9 ns
67108864 : 235.3 ns / 306.6 ns

=== Allwinner A20, 1GHz, MBUS clock = 480 MHz, DRAM clock = 480MHz ===

NEON fill bandwidth = 3201.5 MB/s

block size : single random read / dual random read
262144 : 13.4 ns / 19.1 ns
524288 : 87.1 ns / 134.3 ns
1048576 : 127.7 ns / 173.1 ns
2097152 : 154.1 ns / 193.0 ns
4194304 : 168.6 ns / 201.5 ns
8388608 : 177.6 ns / 209.4 ns
16777216 : 186.9 ns / 222.1 ns
33554432 : 201.5 ns / 248.2 ns
67108864 : 226.0 ns / 295.1 ns

=== Allwinner A20, 1GHz, MBUS clock = 600 MHz, DRAM clock = 480MHz ===

NEON fill bandwidth = 3203.5 MB/s

block size : single random read / dual random read
262144 : 13.3 ns / 19.5 ns
524288 : 84.3 ns / 131.3 ns
1048576 : 123.6 ns / 169.2 ns
2097152 : 150.0 ns / 188.9 ns
4194304 : 164.1 ns / 197.4 ns
8388608 : 172.9 ns / 204.8 ns
16777216 : 182.3 ns / 218.1 ns
33554432 : 196.1 ns / 242.2 ns
67108864 : 219.0 ns / 287.0 ns

The system boots with 600MHz MBUS clock speed, but is not really
stable. So 600MHz is not something that anyone would ever want
to use. Still it is interesting that it does not cause immediate
failures on at least two A20 devices.

Other than improving the memory write bandwidth, the random
memory read latency gets better too (which is understandable,
because we need to transfer 64 bytes when doing cache line fills
on each read miss).

The comments in the kernel sources seem to imply that 400MHz
MBUS clock is supported for Allwinner-A20:
https://github.com/linux-sunxi/linux-sunxi/blob/sunxi-v3.4.61-r0/arch/arm/mach-sun7i/clock/mod_clk.c#L2076

So should we bump the MBUS clock frequency to 400MHz in u-boot?
Or maybe it's better to clock MBUS at the same speed as DRAM?
Some vendors prefer to set DRAM to 384MHz in A20 devices by
default:
http://irclog.whitequark.org/linux-sunxi/2013-07-29#4520613;
DRAM is often overclockable to 480MHz, but MBUS also seems to be
able to run at this speed.

Faster memory speed usually works better when driving 1920x1080
monitors (at least based on the experience dealing with Allwinner A10).
Every little bit helps.

PS. Allwinner-A10 does not have the MBUS configuration register. So
presumably the MBUS clock configuration is hardcoded for A10 (assuming
that it even has MBUS).

Michal Suchanek

unread,
Sep 19, 2013, 9:39:47 AM9/19/13
to linux-sunxi, Siarhei Siamashka, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
Hello,

On 19 September 2013 15:14, Siarhei Siamashka
Maybe go for same MBUS clock as DRAM clock by default?

Then you get fast default on devices that are overclocked anyway and
safe default on devices where vendors want to play it safe with lower
DRAM clocks.

Thanks

Michal

Hans de Goede

unread,
Sep 19, 2013, 10:38:54 AM9/19/13
to linux...@googlegroups.com, Michal Suchanek, Siarhei Siamashka, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
Hi,
Yes that sounds like a good idea to me to, +1

Regards,

Hans

Clement Wong

unread,
Sep 21, 2013, 5:29:09 PM9/21/13
to linux...@googlegroups.com, Michal Suchanek, Siarhei Siamashka, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
Sounds great! +1

BR,
Clement
> --
> You received this message because you are subscribed to the Google Groups "linux-sunxi" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-sunxi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Siarhei Siamashka

unread,
Sep 22, 2013, 5:41:08 AM9/22/13
to Clement Wong, linux...@googlegroups.com, Michal Suchanek, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
> Sounds great! +1

BTW, I'm still giving it a kind of test drive on my Cubieboard2 just
to be sure that it does not regress anything.

The DRAMC and MBUS in my Cubieboard2 can be both clocked up to 552MHz,
still boot and pass basic tests (p7zip benchmark run as "7z b", which
does compression/decompression and also verification of correctness).
This does not in any way mean that I suggest using 552MHz memory clock
frequency on Cubieboard2. I'm just trying to check if there is still
any safety headroom at 480MHz.

But I have only a single device, which makes the results not very
representative. And it's most interesting to figure out why ~20%
of Allwinner A20 chips fail when the memory is clocked at higher
than 400MHz.

http://irclog.whitequark.org/linux-sunxi/2013-07-29#4520613;

I wonder if clocking MBUS from PLL5 instead of PLL6 could improve
reliability and allow DRAMC in these problematic A20 devices to
also go up to 480MHz? Could any "lucky" owners of such devices
give it a try?

A simple patch for sunxi u-boot to clock MBUS from PLL5 and run at
the same clock frequency as DRAMC:

---

diff --git a/arch/arm/cpu/armv7/sunxi/dram.c b/arch/arm/cpu/armv7/sunxi/dram.c
index d7b2fe8..a56e705 100644
--- a/arch/arm/cpu/armv7/sunxi/dram.c
+++ b/arch/arm/cpu/armv7/sunxi/dram.c
@@ -234,12 +234,11 @@ static void mctl_setup_dram_clock(u32 clk)
/* setup MBUS clock */
reg_val = CCM_MBUS_CTRL_GATE |
#ifdef CONFIG_SUN7I
- CCM_MBUS_CTRL_CLK_SRC(CCM_MBUS_CTRL_CLK_SRC_PLL6) |
- CCM_MBUS_CTRL_N(CCM_MBUS_CTRL_N_X(2)) |
+ CCM_MBUS_CTRL_CLK_SRC(CCM_MBUS_CTRL_CLK_SRC_PLL5);
#else
CCM_MBUS_CTRL_CLK_SRC(CCM_MBUS_CTRL_CLK_SRC_PLL5) |
-#endif
CCM_MBUS_CTRL_M(CCM_MBUS_CTRL_M_X(2));
+#endif
writel(reg_val, &ccm->mbus_clk_cfg);

/*

Clement Wong

unread,
Oct 18, 2013, 10:33:18 AM10/18/13
to Siarhei Siamashka, linux...@googlegroups.com, Michal Suchanek, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
Hi guys,

Sorry for not been able to help, I’ve tried to test it but was stuck on another issue which is fixed now, but haven’t got another chance to try it again. I’m on a business trip in Finland and soon going to China, then I’ll be able to test it on 20 devices when I get back to Norway after 3 weeks from now.
The performance looks promising from your test, so it would be really nice if we get more people to test it out, even better if it could get merged to at least staging soon (at 480MHz?).

BR,
Clement

Siarhei Siamashka

unread,
Oct 18, 2013, 2:10:03 PM10/18/13
to Clement Wong, linux...@googlegroups.com, Michal Suchanek, ke...@allwinnertech.com, su...@allwinnertech.com, sh...@allwinnertech.com
On Fri, 18 Oct 2013 17:33:18 +0300
> Hi guys,
>
> Sorry for not been able to help, I’ve tried to test it but was stuck on
> another issue which is fixed now, but haven’t got another chance to try
> it again. I’m on a business trip in Finland and soon going to China,
> then I’ll be able to test it on 20 devices when I get back to Norway
> after 3 weeks from now.
> The performance looks promising from your test, so it would be really nice
> if we get more people to test it out, even better if it could get
> merged to at least staging soon (at 480MHz?).

Just to make it clear. There are different groups of people potentially
participating here:

1) The Allwinner representatives, who have designed the SoC and (at
least in theory) should know all its specs.
2) Board manufacturers (CubieTech and OLIMEX), who can potentially run
tests on a large number of devices. Though they might not have
time and resources to do this.
3) Ordinary end users (such as myself), who have just only one or two
boards.

I tried to run some performance and stability tests on the hardware
that I have and post the results. But it's not safe to assume that all
A20 chips would behave the same and tweak the defaults.

When running at a high MBUS clock (480MHz, same as dram), I have
observed stability problems with rootfs on an SD card. And because
I'm primarily using NFS, I could not spot this issue earlier. Also
nobody has provided any test results from their hardware so far, so
we are still very much in uncharted waters :-)

Now the question is whether the 400MHz MBUS clock is safe on A20, or
we should just stay at 300MHz.
Reply all
Reply to author
Forward
0 new messages