emblod is OK but eLua running eLua in SDRAM is slow

35 views
Skip to first unread message

Martin Guy

unread,
Mar 14, 2011, 2:18:33 AM3/14/11
to miz...@googlegroups.com
Hi
I've submitted a patch to elua-dev to enable emBLOD loading, but it
turns out that running code in SDRAM is much slower than running it in
onboard flash.

The figures I have for two test programs say that it is six times
slower for a recursive fibonacci function and nine times slower for a
simple "for" loop.
It seems to be a running-in-SDRAM issue, not a clock setup issue,
since real-time timers still give the correct delays. There are
reports on avrfreaks of SDRAM access being 3 times slower than
internal RAM for a memcpy(), due to the SDRAM being on a 16-bit data
path and on the other side of the High Speed Bus. Running code, we
incur extra delays because burst mode is not used and, with code and
data both being in SDRAM, it is fairly unlikely that two successive
accesses to SDRAM will be in the same row, which incurs a further row
setup delay for every memory access.
This reduces the performance of the 66MHz AVR32 to the equivalent
of a 7 to 11 megahertz processor.

With all modules and floating point support enabled, the image size
increases to 180KB. Some more space will be taken by the net support
(and adc, pwm, i2c modules), but in any case, using floating point
numbers slows the for loop down by a further factor of three, so it
may be best to stick with integer Lua for Mizar32.
Other options for a fast eLua are to increase the Flash size in the
AT32 part to 256MB, or to keep the interpreter within 120KB; both are
possible.

M

Bogdan Marinescu

unread,
Mar 14, 2011, 3:46:39 AM3/14/11
to miz...@googlegroups.com
Hi,

Both are possible, yes, but I personally douby the performance figures
will change significantly. Lua is a language in which all types (even
numbers) are kept internally in a dynamically allocated C structure.
If this C structure lands in SDRAM, the performance problem will still
be very obvious. I tested this a while ago and while I don't have the
exact tests results anymore it was clear that eLua (running from
internal Flash) with data in SDRAM was a few times (3-4 I think, I
don't remember precisely) slower than eLua with data in the internal
MPU RAM. Not much to be done here, I'm afraid.

Best,
Bogdan

James Snyder

unread,
Mar 14, 2011, 11:43:00 AM3/14/11
to miz...@googlegroups.com, Martin Guy

This is a bit of a shame that it's up at 180 kb. I know that Bogdan
was working on loadable modules on some of the other platforms for
eLua, perhaps that could be useful? Also, at least on 32-bit ARM, we
save a bit with the toolchain instructions suggested on the site by
building newlib with some custom options that prefer size over speed,
I wonder if this might help for AVR32 as well. I was able to get an
AVR32 toolchain built using crosstools-ng but I didn't dig around to
see what build flags they were using at all. I forget what the size
savings end up being but I think it could have been 10 or more kB if I
recall correctly?

Also, as far as integer vs float, while this might make the size issue
slightly worse, we have been considering the LNUM patch for quite some
time now, which is designed to allow higher precision integer
operations and/or improve performance on platforms that lack hardware
FP: http://luaforge.net/frs/?group_id=214&release_id=1341

It doesn't currently apply cleanly, and I'm not sure if it might have
any big-endian issues, but it might at least save compute time when
one is doing integer operations, while allowing for the flexibility of
FP operations when the user would like to have them. If I recall
correctly what the patch will do is check if a number can be
represented as an integer and do the op as an integer op instead of a
floating point op, otherwise it does it as a float op. Perhaps there
are even some ways to use the DSP instructions built-in to the AVR32
in appropriate conditions, but I'm not sure if that's worthwhile or
not?


>
>       M
>

--
James Snyder
Biomedical Engineering
Northwestern University
jbsn...@gmail.com
PGP: http://fanplastic.org/key.txt
Phone: (847) 448-0386

Sergio Sorrenti

unread,
Mar 14, 2011, 11:43:44 AM3/14/11
to miz...@googlegroups.com
There is a Third option?
Run a Kernel of eLua in Flash and load modules externally?
How far is this possibility?

Sergio

2011/3/14 Bogdan Marinescu <bogdan.m...@gmail.com>

James Snyder

unread,
Mar 14, 2011, 9:12:08 PM3/14/11
to miz...@googlegroups.com, Sergio Sorrenti
On Mon, Mar 14, 2011 at 10:43 AM, Sergio Sorrenti
<sergio....@gmail.com> wrote:
> There is a Third option?
> Run a Kernel of eLua in Flash and load modules externally?
> How far is this possibility?

I think only Bogdan can speak to this possibility right now as far as
what platforms he has tried it on and how well it has worked, but it
is something that is on the roadmap:
http://www.eluaproject.net/en_status.html#roadmap

Essentially you would have to place binary modules on one of the
readable filesystems (such as the SD card) and eLua would be able to
load them into RAM at runtime. I suppose this would have some hybrid
advantages over the bootloader purely loading into SDRAM since some of
eLua could be in SRAM (the core interpreter, for example). What I'm
not sure of, however is how large the modules are. Peripheral modules
for eLua are generally not that many kB individually, but they might
make this worthwhile.

I haven't looked at emBLOD, but perhaps there's even some sort of
half-way approach we could take with it to have it load some of the
image into flash (core of Lua) and less accessed components into SDRAM
with the map being generated at compile time? I'm not very familiar
with the bootloader or the memory map/management on AVR32 so I'm not
sure what's involved, but that might be a quicker way to test out a
hybrid load?

One could probably even do a little profiling somehow to try and and
determine what components are best suited to being loaded into onboard
flash vs SDRAM.

As far as the different parts go, it looks like in quantities of ~1000
at DigiKey a 128 kB -> 256 kB part trade costs ~USD$1.60 more per
unit.

I've still mostly been compiling for the STK1100 which has plenty of
flash, and I've also mostly worked with 256k+ ARM targets, so I
haven't really experimented with squeezing images into 128kB, but I
expect you've become experts at this by now :-)

Any thoughts Bogdan?

--

Martin Guy

unread,
Mar 15, 2011, 12:39:22 AM3/15/11
to miz...@googlegroups.com, Bogdan Marinescu
On 14 March 2011 08:46, Bogdan Marinescu <bogdan.m...@gmail.com> wrote:
> Lua is a language in which all types (even
> numbers) are kept internally in a dynamically allocated C structure.
> If this C structure lands in SDRAM, the performance problem will still
> be very obvious. I tested this a while ago and while I don't have the
> exact tests results anymore it was clear that eLua (running from
> internal Flash) with data in SDRAM was a few times (3-4 I think, I
> don't remember precisely) slower than eLua with data in the internal
> MPU RAM. Not much to be done here, I'm afraid.

On the contrary, it suggests making careful use of the internal SRAM.
The current Mizar setup puts initialised and uninitialized data and
the stack in SRAM, while only the SDRAM is used for all the heap.

Here are test results having the heap in SDRAM or in SRAM:

Tests:
1) flashing a LED using "for i = 1,1000000 do end" as the delay
2) function fib(n)
if n <= 2 do return 1 end
return( fib(n-1) + fib(n-2) )
end

Interpreter in flash, heap in SDRAM: 10 flashes take 17.4 secs,
print(fib(25)) takes 4.46 secs
Interpreter in flash, heap in SRAM: 10 flashes take 10.4 secs,
print(fib(25)) takes 2.20 secs.

If we enable both, I would have thought that the same code fragments
would run at different speeds according to the amount of heap in use,
but testing this mixed heap with newlib allocator, it runs these tiny
example programs at SRAM speed, even if you allocate a few dozen K of
rubbish into a table before loading them.

I've pushed some patches into git to make these changes on
AVR32/Mizar32, also reducing the stack size from 8192 to 2048, the
same as the other small platforms.
In practice, even reducing the stack to 256 bytes still seems to work
OK (128 doesn't). How do we get figures for eLua stack usage high
water mark? Does it depend on Lua program recursiveness?

Incidentally, can anyone see where I'm missing some RAM usage?
collectgarbage("count") reports a maximum of 16K of heap being used
out of the 32KB SRAM.
The eLua code uses 3K total:
text data bss dec hex filename
119808 1364 1480 122652 1df1c elua_lualong_at32uc3a0128.elf
and I've set a 2K stack, the same as the other small platforms. Where
have the other 11KB gone?

M

Marcus Jansson

unread,
Mar 15, 2011, 5:06:10 AM3/15/11
to Mizar32


On Mar 15, 2:12 am, James Snyder <jbsny...@fanplastic.org> wrote:
> On Mon, Mar 14, 2011 at 10:43 AM, Sergio Sorrenti
>
> <sergio.sorre...@gmail.com> wrote:
> > There is a Third option?
> > Run a Kernel of eLua in Flash and load modules externally?
> > How far is this possibility?
>
> I haven't looked at emBLOD, but perhaps there's even some sort of
> half-way approach we could take with it to have it load some of the
> image into flash (core of Lua) and less accessed components into SDRAM
> with the map being generated at compile time? I'm not very familiar
> with the bootloader or the memory map/management on AVR32 so I'm not
> sure what's involved, but that might be a quicker way to test out a
> hybrid load?

My plan for emBLOD is to allow loading several different files to
external SDRAM and/or internal SRAM during boot. It is not yet
implemented, but I get started on it right away. Essentially, when
done, it should be possible to have a bootparm.txt on the SD-card as
this example, loading three files at different locations and starting
eLua from internal flash:

bootfile = adc.module pwm.module other.bin
loadaddr = INTERNAL_SRAM EXTERNAL_SDRAM1 EXTERNAL_SDRAM2
bootaddr = ELUA_AT_INTERNAL_FLASH_0x80002XYZ


Today it is only possible to load one file and start from internal
flash:

bootfile = all_external_modules.bin
#Load to SDRAM
loadaddr = 0xd0000000
#Boot from flash
bootaddr = 0x80002xyz

Compilation of the eLua modules and getting the load addresses
automatically is the difficult part. :)
Regards,
Marcus Jansson

Bogdan Marinescu

unread,
Mar 15, 2011, 5:11:26 AM3/15/11
to Martin Guy, miz...@googlegroups.com
On Tue, Mar 15, 2011 at 6:39 AM, Martin Guy <marti...@gmail.com> wrote:
> On the contrary, it suggests making careful use of the internal SRAM.
> The current Mizar setup puts initialised and uninitialized data and
> the stack in SRAM, while only the SDRAM is used for all the heap.

I understand, but this is only a "local fix". Sooner or later you're
going to end up in the SDRAM anyway. And it's gonna be sooner for
things are not trivial.

>
> Here are test results having the heap in SDRAM or in SRAM:
>
> Tests:
> 1) flashing a LED using "for i = 1,1000000 do end" as the delay
> 2) function fib(n)
>        if n <= 2 do return 1 end
>        return( fib(n-1) + fib(n-2) )
>   end
>
> Interpreter in flash, heap in SDRAM: 10 flashes take 17.4 secs,
> print(fib(25)) takes 4.46 secs
> Interpreter in flash, heap in SRAM: 10 flashes take 10.4 secs,
> print(fib(25)) takes 2.20 secs.
>
> If we enable both, I would have thought that the same code fragments
> would run at different speeds according to the amount of heap in use,
> but testing this mixed heap with newlib allocator, it runs these tiny
> example programs at SRAM speed, even if you allocate a few dozen K of
> rubbish into a table before loading them.
>
> I've pushed some patches into git to make these changes on
> AVR32/Mizar32, also reducing the stack size from 8192 to 2048, the
> same as the other small platforms.
> In practice, even reducing the stack to 256 bytes still seems to work
> OK  (128 doesn't).  How do we get figures for eLua stack usage high
> water mark?    Does it depend on Lua program recursiveness?

Yes it does. It's mostly related to the Lua parser. 256 is WAY too few
even if you don't involve the Lua parser, you're likely to get stack
overflows because of automatic variables alone. I'd say that anything
less than 4k is asking for trouble.

> Incidentally, can anyone see where I'm missing some RAM usage?
> collectgarbage("count") reports a maximum of 16K of heap being used
> out of the 32KB SRAM.
> The eLua code uses 3K total:
>   text    data     bss     dec     hex filename
>  119808    1364    1480  122652   1df1c elua_lualong_at32uc3a0128.elf
> and I've set a 2K stack, the same as the other small platforms.  Where
> have the other 11KB gone?

I never really understood that part myself ... Sure, eLua does some
dynamic allocations of its own (buffers, for example) but the
difference is still too high.

Best,
Bogdan

Bogdan Marinescu

unread,
Mar 15, 2011, 5:12:26 AM3/15/11
to miz...@googlegroups.com
On Mon, Mar 14, 2011 at 5:43 PM, Sergio Sorrenti
<sergio....@gmail.com> wrote:
> There is a Third option?
> Run a Kernel of eLua in Flash and load modules externally?
> How far is this possibility?

Quite far at the moment, I'm afraid.

Reply all
Reply to author
Forward
0 new messages