[PATCH sunxi-tools 0/2] fel: Faster FEL USB boot for Allwinner H3

506 views
Skip to first unread message

Siarhei Siamashka

unread,
Nov 10, 2015, 7:41:24 PM11/10/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar
When booting my Orange Pi PC board over USB via FEL, this makes
the kernel image transfer speed actually tolerable instead of
being annoyingly slow.

Tested with Jens Kuske's current work-in-progress U-Boot branch for H3:
https://github.com/jemk/u-boot-sunxi/commits/sunxi/h3

A similar improvement can be also easily done for A33 and A83T.

Siarhei Siamashka (2):
fel: Support for enabling MMU after running SPL on new SoC variants
fel: Enable MMU on Allwinner-H3

fel.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

--
2.4.10

Siarhei Siamashka

unread,
Nov 10, 2015, 7:41:25 PM11/10/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar
When the CPU clock speed is set to 480 MHz by the U-Boot SPL,
the performance improvement for 'sunxi-fel write' transfers
to DRAM is ~95 KB/s -> ~510 KB/s.

When the CPU clock speed is set to 1008 MHz by the U-Boot SPL,
the performance improvement for 'sunxi-fel write' transfers
to DRAM is ~180 KB/s -> ~510 KB/s.

This means that the memcpy step done by the CPU is not a
bottleneck anymore and the transfer speed is limited by
something else.

Signed-off-by: Siarhei Siamashka <siarhei....@gmail.com>
---
fel.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fel.c b/fel.c
index 916afdf..c33580e 100644
--- a/fel.c
+++ b/fel.c
@@ -487,6 +487,7 @@ soc_sram_info soc_sram_info_table[] = {
{
.soc_id = 0x1680, /* Allwinner H3 */
.scratch_addr = 0x2000,
+ .mmu_tt_addr = 0x44000,
.thunk_addr = 0x46E00, .thunk_size = 0x200,
.swap_buffers = a31_sram_swap_buffers,
},
--
2.4.10

Siarhei Siamashka

unread,
Nov 10, 2015, 7:41:25 PM11/10/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar
The BROM in newer SoC variants doesn't enable MMU by default anymore.
So in order to benefit from e4b3da2b17ee9d7c5cab9b80e708b3a309fc4c96
("fel: Faster USB transfers via 'fel write' to DRAM"), we need to
be able to enable it from the 'sunxi-fel' tool.

Signed-off-by: Siarhei Siamashka <siarhei....@gmail.com>
---
fel.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/fel.c b/fel.c
index e9f6450..916afdf 100644
--- a/fel.c
+++ b/fel.c
@@ -388,6 +388,18 @@ typedef struct {
* Note: the entries in the 'swap_buffers' tables need to be sorted by 'buf1'
* addresses. And the 'buf1' addresses are the BROM data buffers, while 'buf2'
* addresses are the intended backup locations.
+ *
+ * Also for performance reasons, we want to have MMU enabled with optimal
+ * section attributes configured (the code from the BROM should use I-cache,
+ * writing data to the DRAM area should use write combining). The reason is
+ * that the BROM FEL protocol implementation relies on a memcpy loop somewhere
+ * on the perfromance critical path when transferring data over USB. The older
+ * SoC variants (A10/A13/A20/A31/A23) already have MMU enabled and we only need
+ * to adjust section attributes. The BROM in newer SoC variants (A33/A83T/H3)
+ * doesn't enable MMU anymore, so we need to find some 16K of spare space in
+ * SRAM to place the translation table there and specify it as the 'mmu_tt_addr'
+ * field in the 'soc_sram_info' structure. This is only an optional performance
+ * feature.
*/
typedef struct {
uint32_t soc_id; /* ID of the SoC */
@@ -396,6 +408,7 @@ typedef struct {
uint32_t thunk_addr; /* Address of the thunk code */
uint32_t thunk_size; /* Maximal size of the thunk code */
uint32_t needs_l2en; /* Set the L2EN bit */
+ uint32_t mmu_tt_addr; /* MMU translation table address */
sram_swap_buffers *swap_buffers;
} soc_sram_info;

@@ -600,6 +613,21 @@ uint32_t aw_get_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info)
return ttbr0;
}

+void aw_set_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t ttbr0)
+{
+ uint32_t arm_code[] = {
+ htole32(0xe59f200c), /* ldr r2, [pc, #12] */
+ htole32(0xee022f10), /* mcr 15, 0, r2, cr2, cr0, {0} */
+ htole32(0xe12fff1e), /* bx lr */
+ };
+
+ ttbr0 = htole32(ttbr0);
+ aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
+ aw_fel_write(usb, &ttbr0, sram_info->scratch_addr + 0x14, sizeof(ttbr0));
+ aw_fel_execute(usb, sram_info->scratch_addr);
+}
+
uint32_t aw_get_sctlr(libusb_device_handle *usb, soc_sram_info *sram_info)
{
uint32_t sctlr = 0;
@@ -616,6 +644,20 @@ uint32_t aw_get_sctlr(libusb_device_handle *usb, soc_sram_info *sram_info)
return sctlr;
}

+uint32_t *aw_generate_mmu_translation_table(void)
+{
+ uint32_t *tt = malloc(16 * 1024);
+ uint32_t i;
+
+ /* The same MMU translation table as used by the A20 BROM */
+ for (i = 0; i < 4096; i++)
+ tt[i] = 0x00000DE2 | (i << 20);
+ tt[0x000] |= 0x1000;
+ tt[0xFFF] |= 0x1000;
+
+ return tt;
+}
+
uint32_t *aw_backup_and_disable_mmu(libusb_device_handle *usb,
soc_sram_info *sram_info)
{
@@ -786,6 +828,12 @@ void aw_fel_write_and_execute_spl(libusb_device_handle *usb,
pr_info("Stack pointers: sp_irq=0x%08X, sp=0x%08X\n", sp_irq, sp);

tt = aw_backup_and_disable_mmu(usb, sram_info);
+ if (!tt && sram_info->mmu_tt_addr) {
+ pr_info("Generating the new MMU translation table at 0x%08X\n",
+ sram_info->mmu_tt_addr);
+ tt = aw_generate_mmu_translation_table();
+ aw_set_ttbr0(usb, sram_info, sram_info->mmu_tt_addr);
+ }

swap_buffers = sram_info->swap_buffers;
for (i = 0; swap_buffers[i].size; i++) {
--
2.4.10

Stefan Monnier

unread,
Nov 11, 2015, 8:20:35 AM11/11/15
to linux...@googlegroups.com
> When the CPU clock speed is set to 480 MHz by the U-Boot SPL,
^^^
You mean MBUS?

> the performance improvement for 'sunxi-fel write' transfers
> to DRAM is ~95 KB/s -> ~510 KB/s.

> When the CPU clock speed is set to 1008 MHz by the U-Boot SPL,
> the performance improvement for 'sunxi-fel write' transfers
> to DRAM is ~180 KB/s -> ~510 KB/s.

I don't understand the relation between the above 2 improvements.
What change causes the speed to go from 95KB/s to 180KB/s?


Stefan

Bernhard Nortmann

unread,
Nov 11, 2015, 10:26:58 AM11/11/15
to Siarhei Siamashka, linux...@googlegroups.com
> + * that the BROM FEL protocol implementation relies on a memcpy loop somewhere
> + * on the perfromance critical path when transferring data over USB. The older
> + * SoC variants (A10/A13/A20/A31/A23) already have MMU enabled and we only need

nitpick: you have a typo in there - s/perfromance/performance/

Regards, B. Nortmann

Peter Korsgaard

unread,
Nov 11, 2015, 5:03:31 PM11/11/15
to Siarhei Siamashka, linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar
>>>>> "Siarhei" == Siarhei Siamashka <siarhei....@gmail.com> writes:

> The BROM in newer SoC variants doesn't enable MMU by default anymore.
> So in order to benefit from e4b3da2b17ee9d7c5cab9b80e708b3a309fc4c96
> ("fel: Faster USB transfers via 'fel write' to DRAM"), we need to
> be able to enable it from the 'sunxi-fel' tool.

> Signed-off-by: Siarhei Siamashka <siarhei....@gmail.com>
> ---
> fel.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 48 insertions(+)

> diff --git a/fel.c b/fel.c
> index e9f6450..916afdf 100644
> --- a/fel.c
> +++ b/fel.c
> @@ -388,6 +388,18 @@ typedef struct {
> * Note: the entries in the 'swap_buffers' tables need to be sorted by 'buf1'
> * addresses. And the 'buf1' addresses are the BROM data buffers, while 'buf2'
> * addresses are the intended backup locations.
> + *
> + * Also for performance reasons, we want to have MMU enabled with optimal
> + * section attributes configured (the code from the BROM should use I-cache,
> + * writing data to the DRAM area should use write combining). The reason is
> + * that the BROM FEL protocol implementation relies on a memcpy loop somewhere
> + * on the perfromance critical path when transferring data over USB. The older

s/perfromance/performance/

> typedef struct {
> uint32_t soc_id; /* ID of the SoC */
> @@ -396,6 +408,7 @@ typedef struct {
> uint32_t thunk_addr; /* Address of the thunk code */
> uint32_t thunk_size; /* Maximal size of the thunk code */
> uint32_t needs_l2en; /* Set the L2EN bit */
> + uint32_t mmu_tt_addr; /* MMU translation table address */

Maybe we should add a comment that this address has to be 16KB aligned?


> +void aw_set_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info,
> + uint32_t ttbr0)
> +{
> + uint32_t arm_code[] = {
> + htole32(0xe59f200c), /* ldr r2, [pc, #12] */
> + htole32(0xee022f10), /* mcr 15, 0, r2, cr2, cr0, {0} */
> + htole32(0xe12fff1e), /* bx lr */
> + };
> +
> + ttbr0 = htole32(ttbr0);
> + aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
> + aw_fel_write(usb, &ttbr0, sram_info->scratch_addr + 0x14, sizeof(ttbr0));
> + aw_fel_execute(usb, sram_info->scratch_addr);

I know this is matching the other functions, but why do we actually
stick the value 8 bytes after the code? Something like this would imho
be simpler and easier to understand:

uint32_t arm_code[] = {
htole32(0xe59f2004), /* ldr r2, [pc, #4] */
htole32(0xee022f10), /* mcr 15, 0, r2, cr2, cr0, {0} */
htole32(0xe12fff1e), /* bx lr */
0, /* ttbr0 goes here */
};

arm_code[3] = htole32(ttbr0);
aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
aw_fel_execute(usb, sram_info->scratch_addr);

> +uint32_t *aw_generate_mmu_translation_table(void)
> +{
> + uint32_t *tt = malloc(16 * 1024);

4096 * sizeof(uint32_t) would imho be clearer.

> + uint32_t i;
> +
> + /* The same MMU translation table as used by the A20 BROM */
> + for (i = 0; i < 4096; i++)
> + tt[i] = 0x00000DE2 | (i << 20);

A comment like /* 1MB identity mapping */ would be good.

> + tt[0x000] |= 0x1000;

A /* TEX */ comment would be good.

> + tt[0xFFF] |= 0x1000;

As we override this in aw_restore_and_enable_mmu() I guess this isn't
strictly needed?

Other than that it looks good to me, thanks!

--
Bye, Peter Korsgaard

Julian Calaby

unread,
Nov 12, 2015, 6:03:42 AM11/12/15
to mon...@iro.umontreal.ca, linux-sunxi
Hi Stefan,
My interpretation of this was that before this change, the transfer
speed was somehow limited by the clock speed. With the MMU enabled, it
appears that the clock speed is no-longer the limiting factor.

(I'm _guessing_ that the transfer has to bounce through the CPU and
this was slow without the MMU for some reason. I'm also guessing that
enabling the MMU has moved the bottleneck elsewhere - possibly the USB
hardware.)

Thanks,

--
Julian Calaby

Email: julian...@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/

Vishnu Patekar

unread,
Nov 12, 2015, 12:13:06 PM11/12/15
to julian...@gmail.com, mon...@iro.umontreal.ca, linux-sunxi
Hello,

On A83T board, after enabling MMU, sunxi-fel write speed is similar 510KB/s (peripheral clock is at  600MHz).

If I increase the PLL6(peripheral clock), performance is improving.

after changing the PLL6 to 1200MHz, transfer speed to DDR3 is : 920.2 KB/s

Though, it is mentioned in datasheet that it's value should be 600. However, allwinner u-boot code make it to 1200MHz.

 

Thanks,

--
Julian Calaby

Email: julian...@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/

--
You received this message because you are subscribed to the Google Groups "linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linux-sunxi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Siarhei Siamashka

unread,
Nov 13, 2015, 4:06:06 AM11/13/15
to Vishnu Patekar, julian...@gmail.com, mon...@iro.umontreal.ca, linux-sunxi
Thanks! That's an interesting discovery. Basically, by tweaking the
PLL6 clock frequency you bring the FEL transfer speed to the A20 level:

https://linux-sunxi.org/FEL/USBBoot#SoC_support_status

It was always a bit strange why A20 is almost twice faster than
everything else. Especially compared to A10/A13, which are sharing
the same clock setup code with A20:

https://github.com/u-boot/u-boot/blob/v2015.10/arch/arm/cpu/armv7/sunxi/clock_sun4i.c

Maybe after doing some experiments, we can pick better USB related
clock settings in U-Boot for all SoC variants :-)


Regarding the A83T SoC. The A82T user manual has quite a nice clock
tree diagrams, which show that USB OTG is using the AHB1 BUS. So it's
probably not just the PLL6 clock speed, but more like the AHB1 divisor
that is responsible for the USB FEL transfer speed increase.

By looking at your recent A83T clock support patches for U-Boot

https://patchwork.ozlabs.org/patch/543587/

we can find

#define PLL6_CFG_DEFAULT 0x80041800 /* 576 MHz */
#define AHB1_ABP1_DIV_DEFAULT 0x00002190

And AHB1_ABP1_DIV_DEFAULT can be decoded as:

AHB1_CLK_DIV_RATIO: 01: /2
AHB1_PRE_DIV: 10: /3
APB1_CLK_RATIO: 01: /2
AHB1_CLK_SRC_SEL: 1X: PLL_PERIPH/ AHB1_PRE_DIV.

Do I understand it correctly that you are currently setting AHB1
clock speed to 576 /2 /3 = 96 MHz ?

The "7.2.5.2. SPI Module Clock Source and Frequency" section of the
A83T manual says that "The SPI_SCLK can in the range from 3Khz to 100
MHZ" and also says that there is a requirement "AHB_CLK >= 2xSPI_SCLK".
Does this effectively mean that we are supposed to clock AHB1 with at
least 200MHz ?

--
Best regards,
Siarhei Siamashka

Siarhei Siamashka

unread,
Nov 13, 2015, 6:52:36 AM11/13/15
to Vishnu Patekar, julian...@gmail.com, mon...@iro.umontreal.ca, linux-sunxi
I can confirm that doubling the AHB1 clock speed by adjusting divisors
on Allwinner H3 can also increase the FEL transfer speed to ~900 KB/s.

Now we just need to know what is the safe AHB clock speed limit for
all the SoC variants, then increase it in U-Boot and profit?

We are still very far from the DFU transfers speeds. But there is
one more trick that still can be tried. We can copy the BROM code
to DRAM and then use MMU to map this copy to the 0xFFF00000 virtual
address, thus substituting the BROM. This code in DRAM can be patched
if necessary. Of course, this only makes sense if the patch is really
small (for example, just replacing something like a memcpy function
with a possibly better implementation).

Siarhei Siamashka

unread,
Nov 15, 2015, 3:31:15 PM11/15/15
to Peter Korsgaard, linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar
Sure. We can even add an error check it in the code in order to be
more friendly to the people who do not read comments carefully ;-)

> > +void aw_set_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info,
> > + uint32_t ttbr0)
> > +{
> > + uint32_t arm_code[] = {
> > + htole32(0xe59f200c), /* ldr r2, [pc, #12] */
> > + htole32(0xee022f10), /* mcr 15, 0, r2, cr2, cr0, {0} */
> > + htole32(0xe12fff1e), /* bx lr */
> > + };
> > +
> > + ttbr0 = htole32(ttbr0);
> > + aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
> > + aw_fel_write(usb, &ttbr0, sram_info->scratch_addr + 0x14, sizeof(ttbr0));
> > + aw_fel_execute(usb, sram_info->scratch_addr);
>
> I know this is matching the other functions, but why do we actually
> stick the value 8 bytes after the code? Something like this would imho
> be simpler and easier to understand:
>
> uint32_t arm_code[] = {
> htole32(0xe59f2004), /* ldr r2, [pc, #4] */
> htole32(0xee022f10), /* mcr 15, 0, r2, cr2, cr0, {0} */
> htole32(0xe12fff1e), /* bx lr */
> 0, /* ttbr0 goes here */
> };
>
> arm_code[3] = htole32(ttbr0);
> aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
> aw_fel_execute(usb, sram_info->scratch_addr);

Yes, your suggestion looks good. I used to assign an arbitrary fixed
offset for such data because this allowed me to comment out or add
instructions easier. But in this particular, case we can indeed make the
code shorter and do one less aw_fel_write(). That's a good catch. I'll
address this in the v2 patch.

> > +uint32_t *aw_generate_mmu_translation_table(void)
> > +{
> > + uint32_t *tt = malloc(16 * 1024);
>
> 4096 * sizeof(uint32_t) would imho be clearer.

Then it is also better to do the same in aw_backup_and_disable_mmu()
function for the sake of consistency. This is just a cosmetic change
and only matter of whether we want to highlight the fact that the
size of the translation table is 16K or want to be consistent with
the loop code that follows below.

> > + uint32_t i;
> > +
> > + /* The same MMU translation table as used by the A20 BROM */
> > + for (i = 0; i < 4096; i++)
> > + tt[i] = 0x00000DE2 | (i << 20);
>
> A comment like /* 1MB identity mapping */ would be good.
>
> > + tt[0x000] |= 0x1000;
>
> A /* TEX */ comment would be good.

OK.

> > + tt[0xFFF] |= 0x1000;
>
> As we override this in aw_restore_and_enable_mmu() I guess this isn't
> strictly needed?

Yes, it is not strictly needed. My approach was to just borrow the
existing translation table from another SoC (which uses roughly the
same memory map) and bring it back to H3 in order to keep our
adjustments minimal.

There are SoC variants with radically different memory map, such as
A80 and A64. The current aw_restore_and_enable_mmu() implementation
is not a good fit for them but I don't have the hardware yet.
Instead of backing up the MMU setup and then restoring it with
some adjustments, it is probably better to just disable the
MMU and then create it from scratch without bothering about how
it was configured before (do it simply based on the SoC memory map).

In fact relying too much on having the pre-existing MMU setup was
the reason of an annoying bug in this implementation. The DACR
register is not initialized on H3 and contains random garbage.
This random garbage happens to be somehow workable on the cold
start by pure luck, but then still causes a deadlock after the
first reboot. I have a fix for this problem in the v2 patch.

> Other than that it looks good to me, thanks!

Thanks for the review!

Siarhei Siamashka

unread,
Nov 15, 2015, 3:51:06 PM11/15/15
to Vishnu Patekar
On Sat, 14 Nov 2015 04:36:16 +0800
Vishnu Patekar <vishnupa...@gmail.com> wrote:

> Hello,
>
> On Fri, Nov 13, 2015 at 7:52 PM, Siarhei Siamashka <
> Yes, I've also verified that xfer speed can go upto ~910KB/s after doubling
> the AHB1 clock.
> below are the results on A83T and A33.
>
> ============A33===================
> 1. CPU Speed 480MHz, MMU Disbaled speed: 85.3 KB/s
> 2. CPU Speed 480MHz, MMU Enabled speed: 511.7 KB/s
> 3. CPU speed 1003MHz, MMU Disabled speed: 184.5 KB/s
> 4. CPU Speed 1003MHz, MMU enabled speed: 511.9 KB/s
> 5. CPU speed 1003MHz, MMU enabled,
> AHB1 freq doubled speed: 971.7 KB/s
>
>
> ============A83t===================
> 1. CPU Speed 480MHz, MMU Disbaled : 112.4 KB/s
> 2. CPU Speed 480MHz, MMU Enabled : 510.7 KB/s
> 3. CPU speed 1003MHz, MMU Disabled : 112.6 KB/s
> 4. CPU Speed 1003MHz, MMU enabled : 510.8 KB/s
> 5. CPU speed 1003MHz, MMU enabled,
> AHB1 freq doubled : 907.7 KB/s

Hello,

Thanks for running these performance tests. They indeed look very
promising. With such transfer speeds, the FEL boot time becomes
less than 5 seconds for the typical size of the mainline kernel.

It is very clear that we probably want to have the highest valid AHB
clock speed configured by U-Boot. It does affect FEL transfer speeds
and maybe also could affect the performance of the other hardware
blocks connected to the AHB bus (for example, the crypto accelerator
can be benchmarked too).

But this should be really discussed in the U-Boot mailing list. As
for the sunxi-tools patch in question, I will just amend the commit
messaage to mention that the AHB clock speed becomes the next
bottleneck.

Chen-Yu Tsai

unread,
Nov 15, 2015, 9:30:00 PM11/15/15
to Siarhei Siamashka, Vishnu Patekar, linux-sunxi, Hans De Goede
Hi,

On Mon, Nov 16, 2015 at 4:50 AM, Siarhei Siamashka
On the A10s/A13/A20/A31/A31s we have AHB clocked from PLL6, at 150 MHz.
This is done in the dts file. For the A31, this is out of necessity, as
the DMA engine has this weird requirement. For the other chips, we wanted
the HSTIMER to be stable even if cpufreq changes the cpu speed.

On the A23/A33 we haven't done so due to some issues with the clock driver.
The default pre-divider is too low, which results in the AHB clock rate
going up to 300 MHz and hanging.

The safest top speed is around 200 ~ 250 MHz. IIRC one of the older
manuals actually has the operating range for the various bus clocks.

In any case, I'm for adding these settings to U-boot.


Regards
ChenYu

> But this should be really discussed in the U-Boot mailing list. As
> for the sunxi-tools patch in question, I will just amend the commit
> messaage to mention that the AHB clock speed becomes the next
> bottleneck.
>
> --
> Best regards,
> Siarhei Siamashka
>

Siarhei Siamashka

unread,
Nov 16, 2015, 8:29:45 PM11/16/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar, Peter Korsgaard
Changes in v2:
* The DACR register is now properly initialized when enabling the MMU.
In the v1 patchset this register was left uninitialized and the
garbage data in it could cause deadlocks depeding on luck.
* Addressed review comments from Bernhard Nortmann and Peter Korsgaard
(a typo fix, more comments, redundant aw_fel_write call removal)
* Introduced a new patch with helper functions for reading/writing ARM
coprocessor registers

Siarhei Siamashka (3):
fel: Helper functions for reading/writing ARM coprocessor registers
fel: Support for enabling MMU after running SPL on new SoC variants
fel: Enable MMU on Allwinner-H3 to improve FEL transfer speed

fel.c | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 167 insertions(+), 24 deletions(-)

--
2.4.10

Siarhei Siamashka

unread,
Nov 16, 2015, 8:29:45 PM11/16/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar, Peter Korsgaard
This helps to reduce code duplication if we want to read and write
many different kinds of coprocessor registers.

Signed-off-by: Siarhei Siamashka <siarhei....@gmail.com>
---
fel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 46 insertions(+), 24 deletions(-)

diff --git a/fel.c b/fel.c
index e9f6450..f394916 100644
--- a/fel.c
+++ b/fel.c
@@ -533,6 +533,50 @@ static uint32_t fel_to_spl_thunk[] = {
#define DRAM_BASE 0x40000000
#define DRAM_SIZE 0x80000000

+uint32_t aw_read_arm_cp_reg(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t coproc, uint32_t opc1, uint32_t crn,
+ uint32_t crm, uint32_t opc2)
+{
+ uint32_t val = 0;
+ uint32_t opcode = 0xEE000000 | (1 << 20) | (1 << 4) |
+ ((opc1 & 7) << 21) |
+ ((crn & 15) << 16) |
+ ((coproc & 15) << 8) |
+ ((opc2 & 7) << 5) |
+ (crm & 15);
+ uint32_t arm_code[] = {
+ htole32(opcode), /* mrc coproc, opc1, r0, crn, crm, opc2 */
+ htole32(0xe58f0000), /* str r0, [pc] */
+ htole32(0xe12fff1e), /* bx lr */
+ };
+ aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
+ aw_fel_execute(usb, sram_info->scratch_addr);
+ aw_fel_read(usb, sram_info->scratch_addr + 12, &val, sizeof(val));
+ return le32toh(val);
+}
+
+void aw_write_arm_cp_reg(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t coproc, uint32_t opc1, uint32_t crn,
+ uint32_t crm, uint32_t opc2, uint32_t val)
+{
+ uint32_t opcode = 0xEE000000 | (0 << 20) | (1 << 4) |
+ ((opc1 & 7) << 21) |
+ ((crn & 15) << 16) |
+ ((coproc & 15) << 8) |
+ ((opc2 & 7) << 5) |
+ (crm & 15);
+ uint32_t arm_code[] = {
+ htole32(0xe59f000c), /* ldr r0, [pc, #12] */
+ htole32(opcode), /* mcr coproc, opc1, r0, crn, crm, opc2 */
+ htole32(0xf57ff04f), /* dsb sy */
+ htole32(0xf57ff06f), /* isb sy */
+ htole32(0xe12fff1e), /* bx lr */
+ htole32(val)
+ };
+ aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
+ aw_fel_execute(usb, sram_info->scratch_addr);
+}
+
void aw_enable_l2_cache(libusb_device_handle *usb, soc_sram_info *sram_info)
{
uint32_t arm_code[] = {
@@ -586,34 +630,12 @@ void aw_get_stackinfo(libusb_device_handle *usb, soc_sram_info *sram_info,

uint32_t aw_get_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info)
{
- uint32_t ttbr0 = 0;
- uint32_t arm_code[] = {
- htole32(0xee122f10), /* mrc 15, 0, r2, cr2, cr0, {0} */
- htole32(0xe58f2008), /* str r2, [pc, #8] */
- htole32(0xe12fff1e), /* bx lr */
- };
-
- aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
- aw_fel_execute(usb, sram_info->scratch_addr);
- aw_fel_read(usb, sram_info->scratch_addr + 0x14, &ttbr0, sizeof(ttbr0));
- ttbr0 = le32toh(ttbr0);
- return ttbr0;
+ return aw_read_arm_cp_reg(usb, sram_info, 15, 0, 2, 0, 0);
}

uint32_t aw_get_sctlr(libusb_device_handle *usb, soc_sram_info *sram_info)
{
- uint32_t sctlr = 0;
- uint32_t arm_code[] = {
- htole32(0xee112f10), /* mrc 15, 0, r2, cr1, cr0, {0} */
- htole32(0xe58f2008), /* str r2, [pc, #8] */
- htole32(0xe12fff1e), /* bx lr */
- };
-
- aw_fel_write(usb, arm_code, sram_info->scratch_addr, sizeof(arm_code));
- aw_fel_execute(usb, sram_info->scratch_addr);
- aw_fel_read(usb, sram_info->scratch_addr + 0x14, &sctlr, sizeof(sctlr));
- sctlr = le32toh(sctlr);
- return sctlr;
+ return aw_read_arm_cp_reg(usb, sram_info, 15, 0, 1, 0, 0);
}

uint32_t *aw_backup_and_disable_mmu(libusb_device_handle *usb,
--
2.4.10

Siarhei Siamashka

unread,
Nov 16, 2015, 8:29:47 PM11/16/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar, Peter Korsgaard
The BROM in newer SoC variants doesn't enable MMU by default anymore.
So in order to benefit from e4b3da2b17ee9d7c5cab9b80e708b3a309fc4c96
("fel: Faster USB transfers via 'fel write' to DRAM"), we need to
be able to enable it from the 'sunxi-fel' tool.

This patch can be interpreted as simply reverting the changes done
by Allwinner and bringing back the MMU support in roughly the same
way as it was before. That's why the values in the hardware
registers and the translation table entries replicate the A20 setup.

Additionally, the code is now more defensive and introduces new
"canary" checks for certain known magic values in the coprocessor
registers in order to safeguard against any unpleasant surprises.

MMU tuning for A80 and A64 will probably need a more sophisticated
setup with a second level page table. Because both the SRAM and
the BROM reside in the same 1MB section there and we need finer
granularity. In other words, enabling the MMU on A80 and A64 is
not supported yet.

Signed-off-by: Siarhei Siamashka <siarhei....@gmail.com>
---
fel.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 122 insertions(+), 2 deletions(-)

diff --git a/fel.c b/fel.c
index f394916..9a08af6 100644
--- a/fel.c
+++ b/fel.c
@@ -388,6 +388,18 @@ typedef struct {
* Note: the entries in the 'swap_buffers' tables need to be sorted by 'buf1'
* addresses. And the 'buf1' addresses are the BROM data buffers, while 'buf2'
* addresses are the intended backup locations.
+ *
+ * Also for performance reasons, we optionally want to have MMU enabled with
+ * optimal section attributes configured (the code from the BROM should use
+ * I-cache, writing data to the DRAM area should use write combining). The
+ * reason is that the BROM FEL protocol implementation moves data using the
+ * CPU somewhere on the performance critical path when transferring data over
+ * USB. The older SoC variants (A10/A13/A20/A31/A23) already have MMU enabled
+ * and we only need to adjust section attributes. The BROM in newer SoC variants
+ * (A33/A83T/H3) doesn't enable MMU anymore, so we need to find some 16K of
+ * spare space in SRAM to place the translation table there and specify it as
+ * the 'mmu_tt_addr' field in the 'soc_sram_info' structure. The 'mmu_tt_addr'
+ * address must be 16K aligned.
*/
typedef struct {
uint32_t soc_id; /* ID of the SoC */
@@ -396,6 +408,7 @@ typedef struct {
uint32_t thunk_addr; /* Address of the thunk code */
uint32_t thunk_size; /* Maximal size of the thunk code */
uint32_t needs_l2en; /* Set the L2EN bit */
+ uint32_t mmu_tt_addr; /* MMU translation table address */
sram_swap_buffers *swap_buffers;
} soc_sram_info;

@@ -633,17 +646,80 @@ uint32_t aw_get_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info)
return aw_read_arm_cp_reg(usb, sram_info, 15, 0, 2, 0, 0);
}

+uint32_t aw_get_ttbcr(libusb_device_handle *usb, soc_sram_info *sram_info)
+{
+ return aw_read_arm_cp_reg(usb, sram_info, 15, 0, 2, 0, 2);
+}
+
+uint32_t aw_get_dacr(libusb_device_handle *usb, soc_sram_info *sram_info)
+{
+ return aw_read_arm_cp_reg(usb, sram_info, 15, 0, 3, 0, 0);
+}
+
uint32_t aw_get_sctlr(libusb_device_handle *usb, soc_sram_info *sram_info)
{
return aw_read_arm_cp_reg(usb, sram_info, 15, 0, 1, 0, 0);
}

+void aw_set_ttbr0(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t ttbr0)
+{
+ return aw_write_arm_cp_reg(usb, sram_info, 15, 0, 2, 0, 0, ttbr0);
+}
+
+void aw_set_ttbcr(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t ttbcr)
+{
+ return aw_write_arm_cp_reg(usb, sram_info, 15, 0, 2, 0, 2, ttbcr);
+}
+
+void aw_set_dacr(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t dacr)
+{
+ aw_write_arm_cp_reg(usb, sram_info, 15, 0, 3, 0, 0, dacr);
+}
+
+void aw_set_sctlr(libusb_device_handle *usb, soc_sram_info *sram_info,
+ uint32_t sctlr)
+{
+ aw_write_arm_cp_reg(usb, sram_info, 15, 0, 1, 0, 0, sctlr);
+}
+
+/*
+ * Reconstruct the same MMU translation table as used by the A20 BROM.
+ * We are basically reverting the changes, introduced in newer SoC
+ * variants. This works fine for the SoC variants with the memory
+ * layout similar to A20 (the SRAM is in the first megabyte of the
+ * address space and the BROM is in the last megabyte of the address
+ * space).
+ */
+uint32_t *aw_generate_mmu_translation_table(void)
+{
+ uint32_t *tt = malloc(4096 * sizeof(uint32_t));
+ uint32_t i;
+
+ /*
+ * Direct mapping using 1MB sections with TEXCB=00000 (Strongly
+ * ordered) for all memory except the first and the last sections,
+ * which have TEXCB=00100 (Normal). Domain bits are set to 1111
+ * and AP bits are set to 11, but this is mostly irrelevant.
+ */
+ for (i = 0; i < 4096; i++)
+ tt[i] = 0x00000DE2 | (i << 20);
+ tt[0x000] |= 0x1000;
+ tt[0xFFF] |= 0x1000;
+
+ return tt;
+}
+
uint32_t *aw_backup_and_disable_mmu(libusb_device_handle *usb,
soc_sram_info *sram_info)
{
uint32_t *tt = NULL;
uint32_t ttbr0 = aw_get_ttbr0(usb, sram_info);
uint32_t sctlr = aw_get_sctlr(usb, sram_info);
+ uint32_t ttbcr = aw_get_ttbcr(usb, sram_info);
+ uint32_t dacr = aw_get_dacr(usb, sram_info);
uint32_t i;

uint32_t arm_code[] = {
@@ -657,13 +733,35 @@ uint32_t *aw_backup_and_disable_mmu(libusb_device_handle *usb,
htole32(0xe12fff1e), /* bx lr */
};

+ /*
+ * Below are some checks for the register values, which are known
+ * to be initialized in this particular way by the existing BROM
+ * implementations. We don't strictly need them to exactly match,
+ * but still have these safety guards in place in order to detect
+ * and review any potential configuration changes in future SoC
+ * variants (if one of these checks fails, then it is not a serious
+ * problem but more likely just an indication that one of these
+ * checks needs to be relaxed).
+ */
+
+ /* Basically, ignore M/Z/I bits and expect no TEX remap */
+ if ((sctlr & ~((1 << 12) | (1 << 11) | 1)) != 0x00C52078) {
+ fprintf(stderr, "Unexpected SCTLR (%08X)\n", sctlr);
+ exit(1);
+ }
+
if (!(sctlr & 1)) {
pr_info("MMU is not enabled by BROM\n");
return NULL;
}

- if ((sctlr >> 28) & 1) {
- fprintf(stderr, "TEX remap is enabled!\n");
+ if (dacr != 0x55555555) {
+ fprintf(stderr, "Unexpected DACR (%08X)\n", dacr);
+ exit(1);
+ }
+
+ if (ttbcr != 0x00000000) {
+ fprintf(stderr, "Unexpected TTBCR (%08X)\n", ttbcr);
exit(1);
}

@@ -808,6 +906,28 @@ void aw_fel_write_and_execute_spl(libusb_device_handle *usb,
pr_info("Stack pointers: sp_irq=0x%08X, sp=0x%08X\n", sp_irq, sp);

tt = aw_backup_and_disable_mmu(usb, sram_info);
+ if (!tt && sram_info->mmu_tt_addr) {
+ if (sram_info->mmu_tt_addr & 0x3FFF) {
+ fprintf(stderr, "SPL: 'mmu_tt_addr' must be 16K aligned\n");
+ exit(1);
+ }
+ pr_info("Generating the new MMU translation table at 0x%08X\n",
+ sram_info->mmu_tt_addr);
+ /*
+ * These settings are used by the BROM in A10/A13/A20 and
+ * we replicate them here when enabling the MMU. The DACR
+ * value 0x55555555 means that accesses are checked against
+ * the permission bits in the translation tables for all
+ * domains. The TTBCR value 0x00000000 means that the short
+ * descriptor translation table format is used, TTBR0 is used
+ * for all the possible virtual addresses (N=0) and that the
+ * translation table must be aligned at a 16K boundary.
+ */
+ aw_set_dacr(usb, sram_info, 0x55555555);
+ aw_set_ttbcr(usb, sram_info, 0x00000000);
+ aw_set_ttbr0(usb, sram_info, sram_info->mmu_tt_addr);
+ tt = aw_generate_mmu_translation_table();

Siarhei Siamashka

unread,
Nov 16, 2015, 8:29:48 PM11/16/15
to linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar, Peter Korsgaard
When the CPU clock speed is set to 480 MHz by the U-Boot SPL,
the performance of 'sunxi-fel write' transfers to DRAM improves
from ~95 KB/s to ~510 KB/s.

When the CPU clock speed is set to 1008 MHz by the U-Boot SPL,
the performance of 'sunxi-fel write' transfers to DRAM improves
from ~180 KB/s to ~510 KB/s.

This means that the CPU is not a bottleneck for FEL transfers
anymore. Further performance improvements are possible by
increasing the AHB1 clock speed in the U-Boot SPL (up to
something like ~900 KB/s).

Signed-off-by: Siarhei Siamashka <siarhei....@gmail.com>
---
fel.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fel.c b/fel.c
index 9a08af6..1544d18 100644
--- a/fel.c
+++ b/fel.c

Peter Korsgaard

unread,
Nov 23, 2015, 2:19:46 AM11/23/15
to Siarhei Siamashka, linux...@googlegroups.com, Bernhard Nortmann, Jens Kuske, Vishnu Patekar
>>>>> "Siarhei" == Siarhei Siamashka <siarhei....@gmail.com> writes:

> Changes in v2:
> * The DACR register is now properly initialized when enabling the MMU.
> In the v1 patchset this register was left uninitialized and the
> garbage data in it could cause deadlocks depeding on luck.
> * Addressed review comments from Bernhard Nortmann and Peter Korsgaard
> (a typo fix, more comments, redundant aw_fel_write call removal)
> * Introduced a new patch with helper functions for reading/writing ARM
> coprocessor registers

> Siarhei Siamashka (3):
> fel: Helper functions for reading/writing ARM coprocessor registers
> fel: Support for enabling MMU after running SPL on new SoC variants
> fel: Enable MMU on Allwinner-H3 to improve FEL transfer speed

The series looks good to me - Thanks for reworking it!

Acked-by: Peter Korsgaard <pe...@korsgaard.com>

--
Venlig hilsen,
Peter Korsgaard
Reply all
Reply to author
Forward
0 new messages