Python on Altera Nios CPU

Vern Muhr

unread,

Sep 20, 2011, 7:11:25 PM9/20/11

to python-on-a-chip

Hi All,

I got PyMite running on an Altera FPGA this morning. It compiled
without trouble in the Eclipse Nios Software Build Tool environment.
Most of the work was tweaking the makefiles, since the Nios SBT build
environment is not organized like the PyMite environment. When I get
things cleaned up, I would be happy to post it.

There are some interesting things that could be done with the Nios/
Pymite combination. For example, the Nios CPU can be configured with
up to 256 custom instructions implemented with configurable FPGA
hardware. It could be possible to hardware assist the VM to speed up
execution of python code.

I couldn't find much documentation on the "stackless green threads".
Are there any examples? I run the stackless version of python for all
my python work. I have also used greenlets.

Best regards, Vern

Gmail - neonmark

unread,

Sep 20, 2011, 8:20:36 PM9/20/11

to python-o...@googlegroups.com

This sounds very interesting to me. I am thinking of the lisp machine.
This executed byte codes in microcode and used a tagged 40 bit architecture.
The tagged architecture used the extra bits to identify the types. This
way the CPU vectored the 8 bit operands differently to the 32 bit floats.
It also meant the bytecodes remained as they were.
Extra bits in the tag were used for ephemeral garbage collection. Doing
this in hardware made the Symbolics machine able to stay up for years as
a developer machine.

It sounds like the Altera FPGA environment might facilitate this. Anyone
interested ?

Specifically in making a system that executes python bytecode natively.

Mark (ex Symbolics but not in h/w division)

Dean Hall

unread,

Sep 20, 2011, 9:40:20 PM9/20/11

to python-o...@googlegroups.com

I thought I had a doc on threads, but I see that I don't.

To see an example of a p14p program running two threads, look in the
system tests. src/tests/system/t075.c is the main() that runs two
threads. In this case, the callables that are run as threads are the
modules t075a.py and t075b.py

$ cd src/tests/system
$ make t075.out
../../tools/pmImgCreator.py -f ../../platform/desktop/pmfeatures.py -c
-u -o t075_img.c --native-file=t075_nat.c t075a.py t075b.py
cc -Os -fno-strict-aliasing -Wall -Wstrict-prototypes -Werror -I../../
vm -I../../platform/desktop -lm -o t075.out t075_nat.c t075_img.c
t075.c ../../platform/desktop/plat.o ../../vm/libpmvm_desktop.a
./t075.out
Thread B
-33417
Thread A
-99

The scheduler is simple round-robin. In order for the VM to be pre-
empted, the system tick must be operational; which means that the
platform interface must periodically call pm_vmPeriodic() giving the
period (in microseconds) as the argument. The common pattern is to
create a periodic timer interrupt on a microcontroller and call
pm_vmPeriodic() from within the ISR.

!!Dean
PS: Great job on the port to Nios. I know a handful of people have
asked if I've ported to Nios. This is the first port to Nios that
I've heard of.

> --
> You are subscribed to the "python-on-a-chip" (or p14p for short)
> Google Group.
> Site: http://groups.google.com/group/python-on-a-chip

Dean Hall

unread,

Sep 20, 2011, 10:01:07 PM9/20/11

to python-o...@googlegroups.com

The PyMite VM is faithful to the Python 2.6 bytecode.
http://docs.python.org/release/2.6/library/dis.html#python-bytecode-instructions

The thing that will make hardware acceleration difficult is that many
bytecode operations have more than one purpose. For example, take the
modulo operator, '%'. It's primary operation is to perform numerical
modulus (remainder, in the case of an integer). But its secondary
operation is used more often. Can you think of what that is? String
formatting. This situation applies to a handful of operators.
BINARY_MULTIPLY also performs sequence repetition. BINARY_ADD also
performs string concatenation. My fear is that all these special
cases will impair an elegant hardware acceleration.

The first place I would want hardware assistance would be in resolving
a dictionary lookup. Right now, PyMite has a brain-dead linear search
for the key in the keys list and returns the value at the
corresponding index in the values list. Hardware could do two things
here: (1) Calculate a CRC-8 on immutable objects to use as a hash
key. (2) Assist hash-table lookup when a key is given.

Since namespaces and dicts are so prevalent throughout the VM and most
Pythonistas' application code. Making a faster Dict implementation
would be a huge first step.

!!Dean

Gmail - neonmark

unread,

Sep 20, 2011, 10:57:29 PM9/20/11

to python-o...@googlegroups.com

On 9/21/2011 2:01 PM, Dean Hall wrote:
> The PyMite VM is faithful to the Python 2.6 bytecode.
> http://docs.python.org/release/2.6/library/dis.html#python-bytecode-instructions
>
>
> The thing that will make hardware acceleration difficult is that many
> bytecode operations have more than one purpose. For example, take the
> modulo operator, '%'. It's primary operation is to perform numerical
> modulus (remainder, in the case of an integer). But its secondary
> operation is used more often. Can you think of what that is? String
> formatting. This situation applies to a handful of operators.
> BINARY_MULTIPLY also performs sequence repetition. BINARY_ADD also
> performs string concatenation. My fear is that all these special
> cases will impair an elegant hardware acceleration.

This is where a tagged architecture comes in handy. Using the extra bits
from ECC RAM (not using it for error checking) and a modified ALU
(preserves tag bits) the data is tagged with its type.
Operands of different types cannot go through the ALU together and raise
an exception. This helps lower the cost of dynamic type checking.

The type of the tag indicates which data path the operands should follow.
SO modulo for strings can be concat, modulo for integers uses an integer
ALU and similarly for floats.
(I note the NIOS ALU does a single cycle multiply/divide so they would
take the same ALU but be packed differently.)

The lisp machines did this very well in hardware and were very efficient
- they died out for different reasons.
start here: http://en.wikipedia.org/wiki/Tagged_architecture
OK. so papers on this - 1985. but that's just because no one is doing
this right now as all GC etc is being done in s/w and tags do not hel[p
you unless you are in a OOP language at h/w level. The paper dispells
tagged architectures for languages other than LISP, APL, Icon and weighs
the cost of tags against the cost of dynamic type checking - which
Python also requires.
http://dl.acm.org/citation.cfm?id=327153.
- Browse it here:
http://www.deepdyve.com/lp/association-for-computing-machinery/tagged-architecture-how-compelling-are-its-advantages-9YkhdAibKc

>
> The first place I would want hardware assistance would be in resolving
> a dictionary lookup. Right now, PyMite has a brain-dead linear search
> for the key in the keys list and returns the value at the
> corresponding index in the values list. Hardware could do two things
> here: (1) Calculate a CRC-8 on immutable objects to use as a hash
> key. (2) Assist hash-table lookup when a key is given.
>
> Since namespaces and dicts are so prevalent throughout the VM and most
> Pythonistas' application code. Making a faster Dict implementation
> would be a huge first step.

Yes - sounds like a much better first step

Vern did you use:
http://www.altera.com/products/devkits/altera/kit-cyc3-embedded.html
or something else. Which FPGA are you using ?

Vern Muhr

unread,

Sep 20, 2011, 11:46:28 PM9/20/11

to python-o...@googlegroups.com

Accelerating dict lookup sounds like a great idea. It would probably be an interesting project. I know someone who has done some custom instruction work. I will have to talk to them. I have briefly looked over the documentation, but I have not actually tried it out yet.

On Tue, Sep 20, 2011 at 7:01 PM, Dean Hall <dwha...@gmail.com> wrote:

The PyMite VM is faithful to the Python 2.6 bytecode.
http://docs.python.org/release/2.6/library/dis.html#python-bytecode-instructions

The thing that will make hardware acceleration difficult is that many bytecode operations have more than one purpose. For example, take the modulo operator, '%'. It's primary operation is to perform numerical modulus (remainder, in the case of an integer). But its secondary operation is used more often. Can you think of what that is? String formatting. This situation applies to a handful of operators. BINARY_MULTIPLY also performs sequence repetition. BINARY_ADD also performs string concatenation. My fear is that all these special cases will impair an elegant hardware acceleration.

The first place I would want hardware assistance would be in resolving a dictionary lookup. Right now, PyMite has a brain-dead linear search for the key in the key works list and returns the value at the corresponding index in the values list. Hardware could do two things here: (1) Calculate a CRC-8 on immutable objects to use as a hash key. (2) Assist hash-table lookup when a key is given.

Vern Muhr

unread,

Sep 20, 2011, 11:54:16 PM9/20/11

to python-o...@googlegroups.com

I'm using a Terrasic DE1 board. It is about $150. I just got a DE0-Nano board. It has a faster and bigger Cyclone IV FPGA, and 32MB of SDRAM. It only costs $79. Note that ALL the Altera tools are free!

http://www.terasic.com.tw

James Thomas

unread,

Sep 21, 2011, 11:29:52 AM9/21/11

to python-o...@googlegroups.com

For anybody interested, I found a pretty good writeup of how Python implements its dictionary lookups.

http://www.laurentluce.com/posts/python-dictionary-implementation/

It is pretty clear that their approach is pretty wasteful of memory (although that could be tailored somewhat). The problem I see is that it requires pre-allocated arrays for holding the hash, key pointer, and value pointer (H,pK,pV) that are a power of 2 length. I'm guessing, but it looks like that on average it will be significantly less than half full. Arguably the hash does not need to be stored if you are willing to recalculate it every time you resize the table.

Given that it seems to me that that a binary tree would actually be a more compact approach overall while providing the benefit of not needing to reallocate and copy arrays routinely (since it is a form of linked list) and reasonably quick lookups that don't require hashing. There is a processing overhead to balancing the tree on insertion so that is a significant trade off. One alternative to balancing the whole tree is to use a Treap ( http://en.wikipedia.org/wiki/Treap ) which provides a statistically balanced tree with a cheap method of balancing at the cost of an integer per node in storage.

Another approach would be to do it similar to Python but using an array of pointers to the (H,pK,pV) objects. If you assume that the array is less than half full on average it saves space and it reduces the need for large contiguous blocks of memory. Resizing the table is much less painful with this approach. If you also used modulo instead of AND for calculating the index the array lengths do not need to be a power of two. The cost would be more frequent resizing and it might require a new perturb algorithm for finding empty slots on a hash collision.

The element of the Python design that is wasteful in space and processing is the perturb approach to handling collisions. But we could use use an array of pointers, modulo sizing, and link lists for collisions. So the array is of pointers to head node objects (H,pK,pV,pO) in which pO points to another node in the case of a modulo'd hash collision. Again, storing H is a processing vs memory trade off. Keeping it means we can do a compare of the non-modulo'd value of the hashes prior to doing a full key compare, and we would not need to recompute the hashes when we change the modulo size. Arguably this construct never gets full, it just starts getting lookup-inefficient and that is what drives you to increase the modulo size. But even with all the extra pointers it is probably more space efficient on average than the Python approach. This approach also lends itself to a small CRC-8 hash since you can limit the array size to 256 without having to do any additional special case handling.

I had thought about a global key table for all dictionaries which seemed like it might be efficient storage wise and likely faster than the current approach. But keeping it accurate when you delete dictionary objects is going to require extra work. I'm not sure this approach is a dead end but it would require quite a bit of engineering to make it work well.

Here is a link for doing CRC in Altera/Nios II: http://www.altera.com/support/examples/nios2/exm-crc-acceleration.html They claim up to a 2000x speedup (they may be talking about an I/O stream, not processor calls) but I don't think you could speed up the CRC-8 by anywhere near that much (8 compares and 8 XORs per character you are hashing). Assuming the compares and XORs take a clock cycle each I would suppose that the speedup would be roughly 20x if the accelerator instruction used a byte at a time accumulator approach. If the custom instruction used a DMA-like approach and the accelerator could load the memory directly then it could be faster than that.

The other thing I could see benefiting from HW acceleration would be doing key compares (assuming the processor has to load memory to registers and loop). Load two memory addresses and a length and let the HW do it straight out of memory for you.

JT

pir...@gmail.com

unread,

Sep 21, 2011, 11:40:44 AM9/21/11

to python-o...@googlegroups.com

> Given that it seems to me that that a binary tree would actually be a more
> compact approach overall while providing the benefit of not needing to
> reallocate and copy arrays routinely (since it is a form of linked list) and
> reasonably quick lookups that don't require hashing. There is a processing
> overhead to balancing the tree on insertion so that is a significant trade
> off. One alternative to balancing the whole tree is to use a Treap (
> http://en.wikipedia.org/wiki/Treap ) which provides a statistically balanced
> tree with a cheap method of balancing at the cost of an integer per node in
> storage.
>

What's the problem with using directly a balanced tree? SQLite uses it
for the rowIDs and it has a high performance (yes, they are integers)
and i don't think Python dicts will have added or removed items so
much frequently...

--
"Si quieres viajar alrededor del mundo y ser invitado a hablar en un
monton de sitios diferentes, simplemente escribe un sistema operativo
Unix."
– Linus Tordvals, creador del sistema operativo Linux

James Thomas

unread,

Sep 21, 2011, 2:31:06 PM9/21/11

to python-o...@googlegroups.com

In Python dictionaries are used all over the place -- I have now forgotten if PyMite uses them as extensively. Local and global variables are stored in dictionaries. Object member functions and variables are stored in dictionaries. So keys are being added to dictionaries all the time even if the user program is not using them. Doing tree balancing each time you do an assignment to a new variable could get very painful and cause wide variability in run speed of a given segment of code. One mitigation could be delayed balancing in situations where you are adding a bunch of keys at once. That is why I suggested a Treap as a reasonable trade off.

pir...@gmail.com

unread,

Sep 22, 2011, 4:59:35 AM9/22/11

to python-o...@googlegroups.com

I have been reading about treap at wikipedia and it seems to me just like a regular binary tree but also with a priority random float to be able to do some balancing and improve performance, and also it talks about randomized binary tree, that it's similar but with integers. It's not a bad idea, and we could use several options: python object id, key-value object inserted order in the dict... this has the disadvantage that all the numbers are increasing, so priority will increase also with every insertion and the tree will be suboptimal. One solution is to reuse the ids of deleted objects, so that give some randomization and also it would be posible to reuse their previous space in some cases. In any case the modification that it talk about generate a new priority id on every access to modify the tree and increase the priority of the more accessed ones it's really interesting for me... :-)

In any case, i have found also at wikipedia page a link for a implementation of the treap: http://code.google.com/p/treapdb/

Gmail - neonmark

unread,

Sep 22, 2011, 8:07:54 AM9/22/11

to python-o...@googlegroups.com

On 9/21/2011 3:54 PM, Vern Muhr wrote:
> I'm using a Terrasic DE1 board. It is about $150. I just got a
> DE0-Nano board. It has a faster and bigger Cyclone IV FPGA, and 32MB
> of SDRAM. It only costs $79. Note that ALL the Altera tools are free!
>
> http://www.terasic.com.tw

Wow - I'm glad you linked that... The Nano is very interesting.
http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=139&No=593&PartNo=4
32MB SDRAM, 74KB Chip RAM, 22K gates + 66 multipliers on a tiny board
wih USB programming.
But only 2KB of flash on board.... Maybe we need to piggyback more or
I'm reading it wrong.
It has same 50MHz clock as the DE1 but uses Cyclone IV instead of II
How did I miss this one. looks like the had a launch in March.

Vern - I would be very keen to hear how you get on with this board and
python.

James Thomas

unread,

Sep 22, 2011, 9:06:18 AM9/22/11

to python-o...@googlegroups.com

The thing to remember is that the current approach is O(N), so even a suboptimal tree is a pretty big speed improvement.

However I actually think that the CRC8 hash with the linked list for hash collisions is the better approach. Looking at the VM code it appears that the dict is using a segmented list to store the keys so the size penalty this approach is not quite as severe as I thought since there is some list overhead and unused space in the current implementation. A benefit of the hash approach is that you can use a regular C array for the key table because it is only resized when it needs to be completely rewritten anyway (moving up or down in modulo/mask size -- all the keys change position). That gets both a speed and space improvement for the key table portion of the code.

One further optimization for the hash-based solutions is to pre-compute the CRC8 hashes in pmImgCreator.py for all string objects and store it as an attribute of the string object. Since we are talking about an 8-bit hash the extra storage required is not terribly painful. I suspect that having the hash stored in the object (and using them for dict and cache lookups) probably would save a lot more clock cycles than storing the string length in the object (at a cost of two bytes) currently does. Obviously there are situations where that does not work (dynamically generated keys, integer or tuple keys) but it would handle the common use case. Any strings created during runtime will have their hash calculated in order to accelerate the string cache lookup and it can be saved into the object at that time. Finally it seems to me that the string cache is a special case of dictionary with no value pointer, so it would probably be wise to leverage the dictionary code to implement a hash based lookup version of it.

Overall I'm fairly convinced that adopting a hash based system for dictionaries and using them to replace all the linear searches in PyMite would result in a huge performance increase overall -- especially for larger programs. They basically let you go from O(N) to O(1).

JT

cb750

unread,

Jun 29, 2012, 3:59:42 PM6/29/12

to python-o...@googlegroups.com

+1: I am also interested in any attempt at an Altera Nios port. If you have source code, I'd be happy to tinker with it.

cb750

Reply all

Reply to author

Forward