Sw/Hw inter-op ?

119 views
Skip to first unread message

Björn Berglöf

unread,
May 11, 2015, 5:23:49 PM5/11/15
to zyli...@googlegroups.com

Hi,

Being active in 100 Gbps telecom stuff and with an FPGA/ASIC background,

the ZPU is a refreshing acquaintance – I do not understand some mechanisms
that maybe you guys can explain.   

 

1/ Most of the ZPU cores can be modified with generics. But how does the
    complier know if e.g. a multiplier is present?
I’ve looked through the
    zillion switches of the zpu-elf-gcc without enlightment. Can it be some
    linking step I do not understand? Can it be related to the “Code points 33 to 63
    may be emulated by code in vectors 2 through 32” mechanism?

 

2/ When running e.g. the Dhrystone test the time is measured. But how does the
    complier know what clock frequency the core is running at?
As far as I can
   understand the “-phi” switch informs the compiler about an address map where
   to find a timer. I suspect it is related to the crt_io.c file? If so, when is this used?

3/ The phi memory map specifies timers and interrupt registers, which fit with
     well with the straight-forward code of the IO-module. But are the “phi”
     registers enough for running a simple RTOS?
I do not really need perfect timing,
     but would like to run multi-threaded to simplify my (multi-channel) programs.   


4/ I’m looking for a Cache solution in front of the ram to push the top speed
    towards 1GHz J Any pointers to work being done in this area? Also any work
    on a stalling / halting the processor? Maybe someone can clearify the
    mem_busy input that seems to have somewhat that intent…    

 

Any comments are greatly appreciated!

     / Björn “MrBear” Berglöf

 

PS.  I’m have synthesized the Zealot core into 28nm. It’s really
      small! 0.006 mm2 excluding the ram. About 500 MHz, but it’s

      my memory w ECC logic that limits the speed.   

 

Øyvind Harboe

unread,
May 11, 2015, 5:37:58 PM5/11/15
to zyli...@googlegroups.com
So make that 100000 ZPU's where you can fit a single Intel CPU? :-)

600mm2/0.006m2 = 100000

http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors
> --
> You received this message because you are subscribed to the Google Groups
> "zylin-zpu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zylin-zpu+...@googlegroups.com.
> To post to this group, send email to zyli...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/zylin-zpu/003501d08c30%24c4ea6980%244ebf3c80%24%40mrbear.se.
> For more options, visit https://groups.google.com/d/optout.



--
Øyvind Harboe - Can Zylin Consulting help on your project?
http://www.zylin.com/

Rick Collins

unread,
May 11, 2015, 7:21:23 PM5/11/15
to zyli...@googlegroups.com
At 05:23 PM 5/11/2015, you wrote:
Hi,
Being active in 100 Gbps telecom stuff and with an FPGA/ASIC background,
the ZPU is a refreshing acquaintance – I do not understand some mechanisms
that maybe you guys can explain. Â Â
 
1/ Most of the ZPU cores can be modified with generics. But
how does the
    complier know if e.g. a multiplier is present?
I’ve looked through the
    zillion switches of the zpu-elf-gcc without enlightment. Can it be some
    linking step I do not understand? Can it be related to the “Code points 33 to 63
    may be emulated by code in vectors 2 through 32” mechanism?

 
2/ When running e.g. the Dhrystone test the time is measured. But
how does the
    complier know what clock frequency the core is running at?
As far as I can
   understand the “-phi” switch informs the compiler about an address map where
   to find a timer. I suspect it is related to the crt_io.c file? If so, when is this used?

3/ The phi memory map specifies timers and interrupt registers, which fit with
     well with the straight-forward code of the IO-module. But are the “phi”
     registers enough for running a simple RTOS?
I do not really need perfect timing,
     but would like to run multi-threaded to simplify my (multi-channel) programs.  Â

Why does the compiler care what clock speed the CPU is running at?  It would be up to you to do the math of converting your timer values to actual time values.


4/ I’m looking for a Cache solution in front of the ram to push the top speed
    towards 1GHz J Any pointers to work being done in this area? Also any work
    on a stalling / halting the processor? Maybe someone can clearify the
    mem_busy input that seems to have somewhat that intent…   Â
 
Any comments are greatly appreciated!
     / Björn “MrBear” Berglöf
 
PS. Â I’m have synthesized the Zealot core into 28nm. It’s really
      small! 0.006 mm2 excluding the ram. About 500 MHz, but it’s
      my memory w ECC logic that limits the speed.  Â

If you are implementing this design in an ASIC you may just achieve 1 GHz if you pipeline it more.  Is there a reason why you don't implement the RAM on the same chip which could speed it up tremendously?  I am not familiar with the Zealot core, but the original ZPU cores were designed for small size and flexibility.  They used a lot of clock cycles to accomplish what many designs do in fewer cycles.  So don't equate a high clock speed with a fast CPU.  I expect the machine architecture can be optimized to reduce the number of clock cycles used at the expense of clock rate and possibly more logic.  But if the clock rate is limited by the memory speed rather than the logic delays you might do better with a more complex architecture.  At 28 nm you certainly can afford to use a few more transistors. 

Rick

bjorn....@mrbear.se

unread,
May 12, 2015, 9:31:12 AM5/12/15
to zyli...@googlegroups.com, bjorn....@mrbear.se

Øyvind wrote:
> So make that 100000 ZPU's where you can fit a single Intel CPU? :-)
> 600mm2/0.006m2 = 100000

Øyvind,
unfortunately
the ram is not small, 64kB takes 0.19mm2, so making a 100 000
core chip will make it a 144mm side :-)   On the serious side - there are
architectures for putting hundreds of processor on the same chip. This processor
starts 300 000 000 new programs (yes programs, not OPs) each second.  (and
yes I'm one of the designers).
http://www.marvell.com/network-processors/technology/data-flow-architecture/

----------------------------------------------------------------------------------


Rick Collins wrote:
> Why does the compiler care what clock speed the
> CPU is running at? It would be up to you to do
> the math of converting your timer values to actual time values.

Rick,
I agree when I write my own code, I easily just read the timer and do the trivial
math. But I noticed that several users seems to run the Dhrystone test (which
includes various time.h stuff) without any reference to compensating for clock speed. 
So, how it is supposed to work is still a mystery to me.

----------------------------------------------------------------------------------


Rick Collins wrote:
> If you are implementing this design in an ASIC you may just achieve 1 GHz if you pipeline it
> more. Is there a reason why you don't implement the RAM on the same chip which could speed it up
> tremendously? I am not familiar with the Zealot core, but the original ZPU cores were designed
> for small size and flexibility. They used a lot of clock cycles to accomplish what many designs
> do in fewer cycles. So don't equate a high clock speed with a fast CPU. I expect the machine
> architecture can be optimized to reduce the number of clock cycles used at the expense of
> clock rate and possibly more logic.

Rick,
I am implementing ram within the same chip, and I am not really trying to push the frequency
for performance reasons. It is just very convenient if the uController runs at the same
speed as the rest to avoid any asynch stuff.

I do understand pipelining of the processor, but I believe it easily runs in 1GHz. My problem
is the RAM access time coupled with the ECC logic (necessary at 28nm) needs to be pipelined.
Keeping the top of the stack + a simple instruction cache in a write-through-cache is one way
to handle this. But I need to halt the Zpu whenever I get a cache miss. Hence the mem_busy
question.

----------------------------------------------------------------------------------


Rick Collins wrote:
> But if the clock rate is limited by the memory speed rather than the logic delays you might
> do better with a more complex architecture. At 28 nm you certainly can afford to use a
> few more transistors.

Rick,
I completely agree! I am really not looking for small or high performance. And I'm not
really in love with ZPU either. But it fulfills 3 Must and 1 NiceToHave:
 - Must be a real LGPL, no code-compatible clone, for legal reasons.
 - Must have open (or commercial) tool chain for C.
 - Must have or be able to add GDG or similar debug.
 - 32b data path to speed up data copying (a substantial part of the job).

This said, I am VERY open to suggestions of other cores that fulfills the above.

----------------------------------------------------------------------------------
Thanks for comments! And again, any input of my original
questions (below) are greatly appreciated.
     / Björn “MrBear” Berglöf

----------------------------------------------------------------------------------

1/ Most of the ZPU cores can be modified with generics. But

how does the

   complier know if e.g. a multiplier is present?

I’ve looked through the

    zillion switches of the zpu-elf-gcc without enlightment. Can it be some

    linking step I do not understand?

2/ When running e.g. the Dhrystone test the time is measured. But

how does the

   complier know what clock frequency the core is running at?

As far as I can

   understand the “-phi” switch informs the compiler about an address map where

   to find a timer. I suspect it is related to the crt_io.c file? If so, when is this used?

3/ The phi memory map specifies timers and interrupt registers, which fit with

     well with the straight-forward code of the IO-module. But

are the “phi”

   registers enough for running a simple RTOS?

I do not really need perfect timing,

     but would like to run multi-threaded to simplify my (multi-channel) programs.   

> --

> You received this message because you are subscribed to the Google Groups "zylin-zpu" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to zylin-zpu+...@googlegroups.com.

> To post to this group, send email to zyli...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

>

> At 05:23 PM 5/11/2015, you wrote:

Rick Collins

unread,
May 12, 2015, 12:55:32 PM5/12/15
to zyli...@googlegroups.com
At 09:31 AM 5/12/2015, you wrote:
----------------------------------------------------------------------------------
Rick Collins wrote:
> But if the clock rate is limited by the memory speed rather than the logic delays you might
> do better with a more complex architecture. At 28 nm you certainly can afford to use a
> few more transistors.

Rick,
I completely agree! I am really not looking for small or high performance. And I'm not
really in love with ZPU either. But it fulfills 3 Must and 1 NiceToHave:
 - Must be a real LGPL, no code-compatible clone, for legal reasons.
 - Must have open (or commercial) tool chain for C.
 - Must have or be able to add GDG or similar debug.
 - 32b data path to speed up data copying (a substantial part of the job).

This said, I am VERY open to suggestions of other cores that fulfills the above.

Ok, I understand better now.  To double the clock speed you may need to perform extensive redesign of the architecture.  Like I said, I don't recall having looked at the Zealot core, but I did look at the original ZPU.  It was designed for minimum size and made heavy reuse of the various components which adds multiplexors slowing the clock.  If you consider the micro-ops that need to be implemented and lay out the logic paths for optimum speed it is likely you can achieve a much more streamlined architecture. 

I seem to recall that the ZPU is the only soft core that meets your second requirement as well as the first.  There are tons of original open cores but last time I looked no one had a C compiler for them.  There are also a few cores that duplicate existing commercial ISAs, but you indicate you don't want to deal with the potential issues. But... as I look it up  I see there are several, larger cores that *are* GPL'd with C compilers. 

http://en.wikipedia.org/wiki/LEON

http://en.wikipedia.org/wiki/OpenSPARC

http://en.wikipedia.org/wiki/OpenRISC

http://en.wikipedia.org/wiki/LatticeMico32

http://en.wikipedia.org/wiki/AEMB

http://en.wikipedia.org/wiki/RISC-V

There are more than one "clean" implementations of the microBlaze and other CPUs which should be ok from a license viewpoint (including the PacoBlaze) .  The RISC-V is all about open source but may be a bigger bite to chew, I think it may be available only as 64 bit.  The LatticeMicro32 is fully open source.  I don't know how large it is or how fast you might run it. 

The main difference with all of these and the ZPU is how the instruction set of the ZPU was designed for minimal implementation.  All the other CPUs will be larger if not much larger.

Rick

Markus Wehner

unread,
May 13, 2015, 3:36:00 AM5/13/15
to zyli...@googlegroups.com
 Hi,

I recently have done a rough comparison of ZPU (Zealot small/medium) and OpenRISC (mor1kx). Both should fulfill the mentioned requirements (the LGPL is an OHDL for mor1kx which is an LGPL adapted to hardware designs). Here are some results based on a Xilinx 7-Series FPGA architecture (6-input LUTs):

Zealot small: ~290 Slices (900 LUTs)
Zealot medium: ~390 Slices (1400 LUTs)
mor1kx, instr cache enabled, data cache disabled: ~1600 Slices

Both are synthesizable > 100 MHz. For performance comparison, instead of using some abstract benchmark, as a rough estimate I just applied a memcpy of 1KByte on external BlockRAM as data transfer is one of my main targets.

Zealot small: 0.4ms
Zealot medium: 0.1ms
mor1kx, classic WB mode: 0.025ms (instr cache en), 0.06ms (instr cache off)

Some other remarks: Both support gcc/newlib/gdb but OpenRISC has a more recent version. The code size of OpenRISC results seems to be about 2-3 times larger than for ZPU. OpenRISC has multicore toolchain support.

Best regards,
Markus
----- Original Message -----
From    :  Rick Collins <gnuar...@arius.com>
Sent    :  Di 12 Mai 2015 18:54:00 CEST 
To      :  zyli...@googlegroups.com
Cc      :  
Subject :  Re: [zylin-zpu] Sw/Hw inter-op ?
At 09:31 AM 5/12/2015, you wrote:
----------------------------------------------------------------------------------
Rick Collins wrote:
> But if the clock rate is limited by the memory speed rather than the logic delays you might
> do better with a more complex architecture. At 28 nm you certainly can afford to use a
> few more transistors.

Rick,
I completely agree! I am really not looking for small or high performance. And I'm not
really in love with ZPU either. But it fulfills 3 Must and 1 NiceToHave:
 - Must be a real LGPL, no code-compatible clone, for legal reasons.
 - Must have open (or commercial) tool chain for C.
 - Must have or be able to add GDG or similar debug.
 - 32b data path to speed up data copying (a substantial part of the job).

This said, I am VERY open to suggestions of other cores that fulfills the above.

Ok, I understand better now.  To double the clock speed you may need to perform extensive redesign of the architecture.  Like I said, I don't recall having looked at the Zealot core, but I did look at the original ZPU.  It was designed for minimum size and made heavy reuse of the various components which adds multiplexors slowing the clock.  If you consider the micro-ops that need to be implemented and lay out the logic paths for optimum speed it is likely you can achieve a much more streamlined architecture.

I seem to recall that the ZPU is the only soft core that meets your second requirement as well as the first.  There are tons of original open cores but last time I looked no one had a C compiler for them.  There are also a few cores that duplicate existing commercial ISAs, but you indicate you don't want to deal with the potential issues. But... as I look it up  I see there are several, larger cores that *are* GPL'd with C compilers.

http://en.wikipedia.org/wiki/LEON

http://en.wikipedia.org/wiki/OpenSPARC

http://en.wikipedia.org/wiki/OpenRISC

http://en.wikipedia.org/wiki/LatticeMico32

http://en.wikipedia.org/wiki/AEMB

http://en.wikipedia.org/wiki/RISC-V

There are more than one "clean" implementations of the microBlaze and other CPUs which should be ok from a license viewpoint (including the <http://en.wikipedia.org/w/index.php?title=PacoBlaze&action=edit&redlink=1>PacoBlaze) .  The RISC-V is all about open source but may be a bigger bite to chew, I think it may be available only as 64 bit.  The LatticeMicro32 is fully open source.  I don't know how large it is or how fast you might run it.

Hieronymus vanWontz

unread,
May 15, 2015, 6:37:36 AM5/15/15
to zyli...@googlegroups.com
Hi,


I completely agree! I am really not looking for small or high performance. And I'm not

really in love with ZPU either. But it fulfills 3 Must and 1 NiceToHave:
 - Must be a real LGPL, no code-compatible clone, for legal reasons.
 - Must have open (or commercial) tool chain for C.
 - Must have or be able to add GDG or similar debug.
 - 32b data path to speed up data copying (a substantial part of the job).

This said, I am VERY open to suggestions of other cores that fulfills the above.

There are very few alternatives in the field, I believe. The xtensa architecture is rather compact and 'hot', but I'm not aware of free clones. On the somewhat larger scale I was so far most happy with the MIPS architecture, but the ZPU is still unbeaten when it comes to low resource usage.
If you can spare more logic and HDL resources, a DMA engine does the data copying job perfectly on the smallest ZPU.
Adding a debugger was quite a piece of cake on the Zealot zpu small, there's some in circuit emulation code in the git repo. I've done quite some work on the JTAG debugger side and am happily downloading to/debugging the ZPU on a daily basis. Due to it's simplicity it's just darn robust and withstood quite some regression testing where other CPU implementations exhibited bugs.

The only drawback is the GCC support which is still at v3.4.x, I believe. So for new CPU derivatives, you'll have to hack around on a bit of buggy old GNU code..
Not sure if someone is at porting the ZPU support to the latest GCC..

Cheers,

- Martin

 

Björn Berglöf

unread,
May 15, 2015, 6:03:23 PM5/15/15
to zyli...@googlegroups.com

Thanks guys

for the last couple of days of feedback, its been very helpful!

 

My  current plan is to go with the MinSoc (based on the or1200)

http://opencores.org/project,minsoc and adopt it to my asic.

Pretty much a perfect fit. J

   / MrBear

 

 

From: zyli...@googlegroups.com [mailto:zyli...@googlegroups.com] On Behalf Of Hieronymus vanWontz
Sent: Friday, May 15, 2015 12:38
To: zyli...@googlegroups.com
Subject: Re: [zylin-zpu] Sw/Hw inter-op ?

 

Hi,

--

You received this message because you are subscribed to the Google Groups "zylin-zpu" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zylin-zpu+...@googlegroups.com.
To post to this group, send email to zyli...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages