[riscv-hw] What do you want from JTAG debug?

Tim Newsome

unread,

Nov 19, 2015, 1:02:14 PM11/19/15

to hw-...@lists.riscv.org

Good morning!

I'm Tim, and I have recently been hired by SiFive to help spec and implement a JTAG debug interface for RISC-V. SiFive is a startup founded by Krste, Yunsup, and Andrew whose mission is to democratize the design and fabrication of

application-specific SoCs using RISC-V processors. This debug spec will be just as open as the RISC-V ISA, and hopefully will be as good as the ISA so everybody can just use it. SiFive will release open source implementations of the debug spec for both the Rocket Chip and Z-Scale cores.

I've spent 8 years at GHS implementing software support for a variety of hardware debug interfaces, so I have a pretty good idea what features people might want, but I'd love to hear what you think before seriously starting to write a spec. Simple features (hardware data breakpoints), workflows (program flash and reboot), other requirements (must work well with 1024 cores), I want to hear it all. I can't promise to fit everything in, but I will do my best.

The current goal is to have a working system done by July 1, 2016. That means a RISC-V implementation on an FPGA with a real JTAG debugger connected to it, debugging RISC-V code. I'm getting started on this ASAP, so please let me know what you want the debug interface to do for you.

Thank you,

Tim

Tommy Thorn

unread,

Nov 19, 2015, 1:52:24 PM11/19/15

to Tim Newsome, hw-...@lists.riscv.org

> On Nov 19, 2015, at 10:02 , Tim Newsome <t...@sifive.com> wrote:
>
> Good morning!
>
> I'm Tim, and I have recently been hired by SiFive to help spec and implement a JTAG debug interface for RISC-V. SiFive is a startup founded by Krste, Yunsup, and Andrew whose mission is to democratize the design and fabrication of
> application-specific SoCs using RISC-V processors.

Wow, that��s big news. Congrats! I look forward to hearing more.

> This debug spec will be just as open as the RISC-V ISA, and hopefully will be as good as the ISA so everybody can just use it. SiFive will release open source implementations of the debug spec for both the Rocket Chip and Z-Scale cores.
>
> I've spent 8 years at GHS implementing software support for a variety of hardware debug interfaces, so I have a pretty good idea what features people might want, but I'd love to hear what you think before seriously starting to write a spec. Simple features (hardware data breakpoints), workflows (program flash and reboot), other requirements (must work well with 1024 cores), I want to hear it all. I can't promise to fit everything in, but I will do my best.

Having worked at semiconductor companies, I��ve been quite frustrated with the
after-thought debugging features typically offered. They tend to be driven by the
hardware implementation and doesn��t fit what a SW developer needs. One typical
example: pick a small subset of the following random and obscure set of events.
Take an interrupt if any of them happens.

Further, in the face of multiple threads, I��ve found ��single step�� debugging impractical
and trace-based debugging the only way to make progress.

For my embedded cores, I implemented(*) two features I found very helpful:

- A trace buffer containing the pc of every taken control flow operation.
This allows you to better tell ��how did we get here�� than you can get from the stack.

- For a subset of memory locations, trace of the cycle-time, hartid, and PC of stores to
this location.

All traces should spill to external memory for large and useful traces.

Regards,
Tommy
(*) my implementation wasn��t quite this general, alas.

Tim Newsome

unread,

Nov 19, 2015, 2:42:28 PM11/19/15

to Tommy Thorn, hw-...@lists.riscv.org

On Thu, Nov 19, 2015 at 10:52 AM, Tommy Thorn <tommy...@thorn.ws> wrote:

> I've spent 8 years at GHS implementing software support for a variety of hardware debug interfaces, so I have a pretty good idea what features people might want, but I'd love to hear what you think before seriously starting to write a spec. Simple features (hardware data breakpoints), workflows (program flash and reboot), other requirements (must work well with 1024 cores), I want to hear it all. I can't promise to fit everything in, but I will do my best.

Having worked at semiconductor companies, I’ve been quite frustrated with the
after-thought debugging features typically offered. They tend to be driven by the
hardware implementation and doesn’t fit what a SW developer needs. One typical
example: pick a small subset of the following random and obscure set of events.
Take an interrupt if any of them happens.

I don't quite follow your example of what doesn't fit. Are you talking about the hardware breakpoints these implementations provide?

Further, in the face of multiple threads, I’ve found “single step” debugging impractical
and trace-based debugging the only way to make progress.

For my embedded cores, I implemented(*) two features I found very helpful:

- A trace buffer containing the pc of every taken control flow operation.
This allows you to better tell “how did we get here” than you can get from the stack.

How do you look at this data? I ask because in theory you can store more information in fewer bits, but then a lot more tooling is required in the debugger to show it in a way that makes sense. If you're in the habit of just hex dumping a buffer, then that might not be what you want. But if you're always in an environment where it's easy to add some tooling, then a fancier trace format would be preferable.

- For a subset of memory locations, trace of the cycle-time, hartid, and PC of stores to
this location.

By cycle time you mean a timestamp (in cycles) when the instruction was executed, or the time it actually took for the store to happen? (The latter actually seems pretty tricky to define, with caches in the mix.)

How many memory ranges would you typically be interested in?

How useful would it be to store the value of each store as well?

All traces should spill to external memory for large and useful traces.

How large is large?

Are you interested at all in a dedicated trace port, so a program can be traced without affecting its timing at all?

Sorry for all the questions. I'm just trying to put together exactly what you want. I think I know what you want, but it's better to ask and be sure.

Thanks for the feedback,

Tim

Jamey Hicks

unread,

Nov 19, 2015, 2:59:47 PM11/19/15

to Tim Newsome, Tommy Thorn, hw-...@lists.riscv.org

I don't think I've ever used a JTAG-based debugger, but execution traces are extremely useful.

At Nokia, what we mostly used was timestamped process ID and procedure entry/exit so that we could measure system performance. For example, looking at full stack performance including kernel, database, and application when flick-scrolling on the touchscreen so we could see where the time was going.

Here at CSAIl we are working on a RISC-V implementation in BSV and we're using online tracing of execution to verify correct execution of our implementation.

As you point out, some tooling is required. The current state of the art for tracing on production hardware is extremely difficult to use. Surely with an open ISA and open source software we can do better than that. I am completely agnostic about what the hardware interface is for configuration and control of debug, as long as it is a well documented interface for which open source tools are available.

Jamey

Arun Thomas

unread,

Nov 19, 2015, 3:20:35 PM11/19/15

to Tim Newsome, hw-...@lists.riscv.org

On Thu, Nov 19, 2015 at 1:03 PM Tim Newsome <t...@sifive.com> wrote:

<snip>

The current goal is to have a working system done by July 1, 2016. That means a RISC-V implementation on an FPGA with a real JTAG debugger connected to it, debugging RISC-V code. I'm getting started on this ASAP, so please let me know what you want the debug interface to do for you.

Along the lines of what Tommy and Jamey were suggesting, I'd like to see something like Intel Processor Trace (PT) on RISC-V:

https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

https://lwn.net/Articles/648154/

http://events.linuxfoundation.org/sites/events/files/slides/lcna13_kleen.pdf

ARM has similar features in CoreSight, and they're attempting to align with Intel PT as much as possible:

https://wiki.freebsd.org/201508DevSummit?action=AttachFile&do=view&target=bsdcam-hardware-trace.pdf

Best,

Arun

Stefan Wallentowitz

unread,

Nov 19, 2015, 3:20:47 PM11/19/15

to hw-...@lists.riscv.org

On 19.11.2015 20:59, Jamey Hicks wrote:
> I don't think I've ever used a JTAG-based debugger, but execution traces
> are extremely useful.

It may be interesting for you to follow or help us by reviewing at the
Open SoC Debug project:

* Website: opensocdebug.org

* Introductory slides: http://opensocdebug.org/slides/2015-11-12-overview/

We are transferring some stuff from a previous project, but mostly try
to do a clean room specification.

One of the first things we add thanks to Andreas Traber and the Pulp
team is support for run control debug for cores based on the advanced
debug system (OpenRISC-ish, but pretty generic).

Please contact me if deeper interested.

Best,
Stefan

Tim Newsome

unread,

Nov 19, 2015, 3:32:52 PM11/19/15

to Jamey Hicks, Tommy Thorn, hw-...@lists.riscv.org

On Thu, Nov 19, 2015 at 11:59 AM, Jamey Hicks <jamey...@gmail.com> wrote:

I don't think I've ever used a JTAG-based debugger, but execution traces are extremely useful.

At Nokia, what we mostly used was timestamped process ID and procedure entry/exit so that we could measure system performance. For example, looking at full stack performance including kernel, database, and application when flick-scrolling on the touchscreen so we could see where the time was going.

Here at CSAIl we are working on a RISC-V implementation in BSV and we're using online tracing of execution to verify correct execution of our implementation.

What exactly does "online" mean in this context?

As you point out, some tooling is required. The current state of the art for tracing on production hardware is extremely difficult to use. Surely with an open ISA and open source software we can do better than that. I am completely agnostic about what the hardware interface is for configuration and control of debug, as long as it is a well documented interface for which open source tools are available.

Is there anything you're not agnostic about? (Eg. type of data collected during trace, size of trace buffer, ...)

Tim

Ray VanDeWalker

unread,

Nov 19, 2015, 4:03:18 PM11/19/15

to hw-...@lists.riscv.org

I think you ought to do an IEEE-standard debugger.

IEEE-ISTO 5001-2003 (Nexus) feature set, class I, is the basic JTAG interface, cheap, practical and it meets many needs.

Then as time and budget permits, add hardware and software options supporting classes 2 through 4.

The Nexus options are standardized, and were developed by demanding customers, e.g. people debugging nonstop engine controllers.

Most of the features people have asked-for so far fit into these options.

And, then, you’re done. You have a range of good debuggers by any standard.

Tim Newsome

unread,

Nov 19, 2015, 4:07:56 PM11/19/15

to Arun Thomas, hw-...@lists.riscv.org

On Thu, Nov 19, 2015 at 12:20 PM, Arun Thomas <arun....@gmail.com> wrote:

On Thu, Nov 19, 2015 at 1:03 PM Tim Newsome <t...@sifive.com> wrote:
<snip>
The current goal is to have a working system done by July 1, 2016. That means a RISC-V implementation on an FPGA with a real JTAG debugger connected to it, debugging RISC-V code. I'm getting started on this ASAP, so please let me know what you want the debug interface to do for you.

Along the lines of what Tommy and Jamey were suggesting, I'd like to see something like Intel Processor Trace (PT) on RISC-V:

That's very cool, but also a huge spec. While it seems like a good goal, it's going to take more than 6 months to put that together.

Are there any features in particular that you want?

For instance, would you be happy with tracing everything to a contiguous buffer (instead of adding everything required to let an individual user process use tracing without revealing data from other processes)?

Tim

Tommy Thorn

unread,

Nov 19, 2015, 5:24:31 PM11/19/15

to Tim Newsome, hw-...@lists.riscv.org

Having worked at semiconductor companies, I’ve been quite frustrated with the
after-thought debugging features typically offered. They tend to be driven by the
hardware implementation and doesn’t fit what a SW developer needs. One typical
example: pick a small subset of the following random and obscure set of events.
Take an interrupt if any of them happens.

I don't quite follow your example of what doesn't fit. Are you talking about the hardware breakpoints these implementations provide?

Sorry, I was making a general statement about debugging facilities.

In the example above, events might be some functional unit transitioning

between two state. With my best guess at what you consider hardware breakpoints

(eg. exists i: pc == K[i] => exception), this would be different.

- A trace buffer containing the pc of every taken control flow operation.
This allows you to better tell “how did we get here” than you can get from the stack.

How do you look at this data? I ask because in theory you can store more information in fewer bits, but then a lot more tooling is required in the debugger to show it in a way that makes sense.

I assume controls to flush the internal buffer and CSRs pointing to the start and end of the trace buffer.

Access to the buffer would be the same way you access memory. You do provide a way to read and

write memory over JTAG, right?

If you're in the habit of just hex dumping a buffer, then that might not be what you want. But if you're always in an environment where it's easy to add some tooling, then a fancier trace format would be preferable.

I’m not sure what you mean, but it’s hardly difficult to add tool support for my suggestion.

- For a subset of memory locations, trace of the cycle-time, hartid, and PC of stores to
this location.

By cycle time you mean a timestamp (in cycles) when the instruction was executed, or the time it actually took for the store to happen? (The latter actually seems pretty tricky to define, with caches in the mix.)

I actually had the former in mind, but the latter would be amazing for

debugging memory races. Alas, probably too expensive in practice.

How many memory ranges would you typically be interested in?

That’s an important question. Having a small (say, 4) memory location

points is IMO useless in practice. If the implementation have a TLB,

then I’d consider enabling tracing on a per-page level, with the bit in the

PTE.

How useful would it be to store the value of each store as well?

data + byteenables, yes I forgot(!) that.

How large is large?

128K+ events, but obviously “large” is context dependent.

Are you interested at all in a dedicated trace port, so a program can be traced without affecting its timing at all?

IMO, anything that requires specialized hardware will not get as much usage so *I* wouldn’t,

but you are of course correct that completely avoiding observation effects would require

dedicated resources, such as a dedicated trace buffer [per core].

Sorry for all the questions. I'm just trying to put together exactly what you want. I think I know what you want, but it's better to ask and be sure.

Thanks for asking. I appreciate that these are just my opinions, but

as with everything RISC-V, this is an opportunity to do things better.

Regards,

Tommy

Tim Newsome

unread,

Nov 19, 2015, 8:29:02 PM11/19/15

to Tommy Thorn, hw-...@lists.riscv.org

On Thu, Nov 19, 2015 at 2:24 PM, Tommy Thorn <tommy...@thorn.ws> wrote:

- A trace buffer containing the pc of every taken control flow operation.
This allows you to better tell “how did we get here” than you can get from the stack.

How do you look at this data? I ask because in theory you can store more information in fewer bits, but then a lot more tooling is required in the debugger to show it in a way that makes sense.

I assume controls to flush the internal buffer and CSRs pointing to the start and end of the trace buffer.
Access to the buffer would be the same way you access memory. You do provide a way to read and
write memory over JTAG, right?

Yes, you can read/write memory over JTAG.

I meant more in a user experience kind of way.

If you're in the habit of just hex dumping a buffer, then that might not be what you want. But if you're always in an environment where it's easy to add some tooling, then a fancier trace format would be preferable.

I’m not sure what you mean, but it’s hardly difficult to add tool support for my suggestion.

Depends on how fancy we get.

You can save a lot of space by only storing a single bit for every branch taken/not taken, instead of storing full PCs. That means you need access to the source that was running, though. And in a modern system where libraries are dynamically loaded and unloaded, or when there's self-modifying code that can get tricky.

(My inclination would be to use some compression, and not support things like self-modifying code.)

How many memory ranges would you typically be interested in?

That’s an important question. Having a small (say, 4) memory location
points is IMO useless in practice. If the implementation have a TLB,
then I’d consider enabling tracing on a per-page level, with the bit in the
PTE.

I believe as a general design principle we don't like to put things in the TLB that don't "belong" there. How many pages might you want to set this bit on?

What about memory ranges, instead of locations?

Ie. the breakpoint hits if the address accessed is between A and B. 4 sets of those seems pretty typical, although I'm not sure if it's enough to be useful. (The spec will probably allow for a largish number of them, if the implementer so chooses.)

How useful would it be to store the value of each store as well?

data + byteenables, yes I forgot(!) that.

How large is large?

128K+ events, but obviously “large” is context dependent.

That's why I ask. :-)

Are you interested at all in a dedicated trace port, so a program can be traced without affecting its timing at all?

IMO, anything that requires specialized hardware will not get as much usage so *I* wouldn’t,
but you are of course correct that completely avoiding observation effects would require
dedicated resources, such as a dedicated trace buffer [per core].

Sorry for all the questions. I'm just trying to put together exactly what you want. I think I know what you want, but it's better to ask and be sure.

Thanks for asking. I appreciate that these are just my opinions, but
as with everything RISC-V, this is an opportunity to do things better.

Your opinions are what I asked for.

Thank you,

Tim

Tommy Thorn

unread,

Nov 19, 2015, 8:36:03 PM11/19/15

to Tim Newsome, hw-...@lists.riscv.org

> You can save a lot of space by only storing a single bit for every branch taken/not taken, instead of storing full PCs.

No, that fails for fx. computed function calls.

> I believe as a general design principle we don't like to put things in the TLB that don't "belong" there. How many pages might you want to set this bit on?
> What about memory ranges, instead of locations?
> Ie. the breakpoint hits if the address accessed is between A and B. 4 sets of those seems pretty typical, although I'm not sure if it's enough to be useful. (The spec will probably allow for a largish number of them, if the implementer so chooses.)

That��s is understandable, but unfortunate. The only thing I can say is that the more general such a feature is,
the more cases it would be applicable to.

One of the hardest things I ever had to debug was a generational garbage collector (each session took days).
A feature such as this would have been enormously useful, but only if it could cover the entire heap. YMMV.

Regards,
Tommy

Tim Newsome

unread,

Nov 19, 2015, 8:46:19 PM11/19/15

to Tommy Thorn, hw-...@lists.riscv.org

On Thu, Nov 19, 2015 at 5:36 PM, Tommy Thorn <tommy...@thorn.ws> wrote:

> You can save a lot of space by only storing a single bit for every branch taken/not taken, instead of storing full PCs.

No, that fails for fx. computed function calls.

I didn't describe the entire scheme. Basically if by code inspection you can figure it out, use the single bit. Otherwise output the full PC. (Or better: you can output only the LSBs of the PC that actually changed.)

Scratch that. It fails completely! Hint: many branches can target the same address. How do you know which one?

You output a bit when the branch is executed (not necessarily taken). So eg.

loop:

; body

bne r1, r2, loop

jr r3

Let's say the body of the loop is executed 3 times, and there's a breakpoint at the address r3 points to. The trace would be:

Trace start at <loop>

Branch taken

Branch not taken

Jump to <contents of r3>

Then you compress that down to short encodings. So 100<address> for trace started, 111 for branch taken, 110 for branch not taken, 101<contents> for Jump to. (I'm typing this up on the fly. Schemes like this exist, including Nexus trace and Intel's trace mentioned earlier in this thread.)

Now the debugger can look through the trace data and figure out what happened, because it knows what the code was. But without the code it can't do much.

> I believe as a general design principle we don't like to put things in the TLB that don't "belong" there. How many pages might you want to set this bit on?
> What about memory ranges, instead of locations?
> Ie. the breakpoint hits if the address accessed is between A and B. 4 sets of those seems pretty typical, although I'm not sure if it's enough to be useful. (The spec will probably allow for a largish number of them, if the implementer so chooses.)

That’s is understandable, but unfortunate. The only thing I can say is that the more general such a feature is,
the more cases it would be applicable to.

One of the hardest things I ever had to debug was a generational garbage collector (each session took days).
A feature such as this would have been enormously useful, but only if it could cover the entire heap. YMMV.

Was it contiguous in virtual address space at least? A single data hardware breakpoint could span the entire range. (Of course that doesn't help if some other process is modifying the heap using a different mapping.)

Tim

Regards,
Tommy

Stefan Wallentowitz

unread,

Nov 20, 2015, 2:54:27 AM11/20/15

to hw-...@lists.riscv.org

Hi Tim,

On 19.11.2015 22:30, Tim Newsome wrote:
> I took a read through your slides. (I like this presentation format, but
> I'd like to see slide N/M displayed at the bottom.)

Yeah, I used this reveal.js for the first time, seems it is an easy way
to do online slides with markdown and host them on github. Will see if
there is an N/M display option. Thanks.

> Using glip over JTAG seems a bit heavy-weight for a small processor.
> This might be fine for a full-featured CPU with several cores, but it
> seems a bit wasteful for a microcontroller. For instance, it requires
> sending 48 bits of data just to set up a data transfer, which is a huge
> amount of overhead since typical data transfers are small. (User prints
> out the value of a register, changes the value of a variable, etc.) On
> most 32-bit processors, 48 bits might be the largest register in the
> chip, just to accommodate this overhead! There are large data transfers,
> but even there you're limited by the size of a reasonable register on a
> chip. Nobody's going to put a 4KB shift register on the chip just to be
> able to efficiently download a program. (Maybe you can reuse on-chip
> cache for that, but I suspect just adding the hardware to turn it into a
> shift register is prohibitively expensive.)

It is more or less I/O driven. If you have no high-speed I/O you can use
JTAG and that only brings you to decent speed for live tracing if you
use large shift registers. If you don't use a large shift register, your
throughput will be less of course. But that gives you what you also get
with standard JTAG debug interfaces, right?

But as said, I see the JTAG option more as a generic option if you start
using open soc debug.

> Your ideas of having one component trigger trace on another are more
> interesting to me.

This is indeed an interesting topic in research and already done in some
companis, like what someone else on the thread mentioned as online trace
processing. A good starting point if you are interested in this topic is
a nice presentation from a workshop we held
(http://www.mad-workshop.de/2013.html):
http://www.mad-workshop.de/slides/7_1.pdf

> Typo on the DEBUG PROCESSORS slides: microporgrams

Thanks for pointing out. I will do a revision and also include plans to
add a logic analyzer module type.

Best,
Stefan

Richard Herveille

unread,

Nov 20, 2015, 3:36:19 AM11/20/15

to Tim Newsome, Richard Herveille, Tommy Thorn, hw-...@lists.riscv.org

So these are all cool and awesome ideas and make a lot of sense for big systems, but we are in the microcontroller segment … and when the debug system gets bigger than the actual CPU something is off.

So how about creating a list of increasingly complexer/more powerful debug facilities? Ranging from bare minimum (i.e. single step), to slightly more complex (HW breakpoints/watchpoints), … to full per-thread traces.

I know NEXUS follows a similar approach; but NEXUS closed up. You need to pay to get even basic access to their specs. Which contradicts why we chose to opt for the RISC-V (free). Although the 1999 release is still available on the web and was released for free.

Also I would suggest to split the debug controller from the actual interface. There’s strong case to have the same debug system work over JTAG, USB, Ethernet, ….

What definitely needs to be defined is the set of registers/features the CPU must support. Then the behaviour, states, and features of the debug controller, then how it interfaces to multiple physical interfaces.

We are currently working on a JTAG interface, but there’s no reason the debug controller couldn't communicate via USB. Reason why we’re currently focused on JTAG is because that interface is simply available on most (if not all) designs and platforms we work on and adding additional pins or a different interface is not allowed by our customers.

Richard

ROA LOGIC

Design Services and Silicon Proven IP

Richard Herveille

Managing Director

Phone +31 (45) 405 5681

Cell +31 (6) 5207 2230

richard....@roalogic.com

Richard Herveille

unread,

Nov 20, 2015, 3:39:29 AM11/20/15

to Tim Newsome, Richard Herveille, Tommy Thorn, hw-...@lists.riscv.org

So these are all cool and awesome ideas and make a lot of sense for big systems, but we are in the microcontroller segment … and when the debug system gets bigger than the actual CPU something is off.

So how about creating a list of increasingly complexer/more powerful debug facilities? Ranging from bare minimum (i.e. single step), to slightly more complex (HW breakpoints/watchpoints), … to full per-thread traces.

I know NEXUS follows a similar approach; but NEXUS closed up. You need to pay to get even basic access to their specs. Which contradicts why we chose to opt for the RISC-V (free). Although the 1999 release is still available on the web and was released for free.

Also I would suggest to split the debug controller from the actual interface. There’s strong case to have the same debug system work over JTAG, USB, Ethernet, ….

What definitely needs to be defined is the set of registers/features the CPU must support. Then the behaviour, states, and features of the debug controller, then how it interfaces to multiple physical interfaces.

We are currently working on a JTAG interface, but there’s no reason the debug controller couldn't communicate via USB. Reason why we’re currently focused on JTAG is because that interface is simply available on most (if not all) designs and platforms we work on and adding additional pins or a different interface is not allowed by our customers.

Richard

ROA LOGIC

Design Services and Silicon Proven IP

Richard Herveille

Managing Director

Phone +31 (45) 405 5681

Cell +31 (6) 5207 2230

richard....@roalogic.com

On 20 Nov 2015, at 03:46, Tim Newsome <t...@sifive.com> wrote:

Stefan Wallentowitz

unread,

Nov 20, 2015, 4:11:12 AM11/20/15

to hw-...@lists.riscv.org

On 20.11.2015 09:36, Richard Herveille wrote:
> I know NEXUS follows a similar approach; but NEXUS closed up. You need
> to pay to get even basic access to their specs. Which contradicts why we
> chose to opt for the RISC-V (free). Although the 1999 release is still
> available on the web and was released for free.

I fully agree on Richard's NEXUS opinion and would like to add that even
the industry that created it doesn't use it widely. Together with it
being a closed standard I think there remains no good reason to use it.

> Also I would suggest to split the debug controller from the actual
> interface. There��s strong case to have the same debug system work over
> JTAG, USB, Ethernet, ��.

Exactly, I think this is the most important that it should not be a
_JTAG_ debug controller, but the debug interface controller with a
simple MMIO-ish interface.

> What definitely needs to be defined is the set of registers/features the
> CPU must support. Then the behaviour, states, and features of the debug
> controller, then how it interfaces to multiple physical interfaces.

Andreas and I recently started something in this direction and are at
the very beginning. Basically we start with the adv_dbg_sys interface
and plan an adapter to a rather generic MMIO-interface plus an interrupt
mechanism to signal events.

Coming back to Tim's original question, I think the most important thing
I want from a JTAG debug is that it is developed together with other
projects in the community for better interoperability.

Best,
Stefan

Jamey Hicks

unread,

Nov 20, 2015, 9:23:56 AM11/20/15

to Tim Newsome, Tommy Thorn, hw-...@lists.riscv.org

On Nov 19, 2015, at 3:32 PM, Tim Newsome <t...@sifive.com> wrote:

On Thu, Nov 19, 2015 at 11:59 AM, Jamey Hicks <jamey...@gmail.com> wrote:
I don't think I've ever used a JTAG-based debugger, but execution traces are extremely useful.

At Nokia, what we mostly used was timestamped process ID and procedure entry/exit so that we could measure system performance. For example, looking at full stack performance including kernel, database, and application when flick-scrolling on the touchscreen so we could see where the time was going.

Here at CSAIl we are working on a RISC-V implementation in BSV and we're using online tracing of execution to verify correct execution of our implementation.

What exactly does "online" mean in this context?

By online, I mean we run the ISA simulator and the FPGA implementation in lockstep and compare the trace output without storing it anywhere.

As you point out, some tooling is required. The current state of the art for tracing on production hardware is extremely difficult to use. Surely with an open ISA and open source software we can do better than that. I am completely agnostic about what the hardware interface is for configuration and control of debug, as long as it is a well documented interface for which open source tools are available.

Is there anything you're not agnostic about? (Eg. type of data collected during trace, size of trace buffer, ...)

It’s pretty clear the community needs a range of options for that, depending on the needs of the applications. Even on higher end embedded devices, I expect that there will be pressure to reduce pin counts, so we need a low pin count interface to the debug/trace functionality.

I’m sure that JTAG will be required for some applications, but I would prefer a packet oriented interface, so we can multiplex debug commands/responses, messages from the application (e.g., kernel console), and various types and levels of traces. For extremely small devices, the physical interface could be two or three wires, like SPI or I2C.

For higher performance devices, USB-3 seems like it might work. That is the direction we’re leaning if we get funding to build RISC-V prototype chips. Off-the-shelf debug/trace controllers. :)

Jamey

David Chisnall

unread,

Nov 20, 2015, 9:26:22 AM11/20/15

to Tim Newsome, hw-...@lists.riscv.org

On 19 Nov 2015, at 18:02, Tim Newsome <t...@sifive.com> wrote:
>
> so please let me know what you want the debug interface to do for you.

The BERI debug unit (which we talk to over JTAG) allows:

- Recording stream traces into a buffer, which can be dumped over JTAG (or PCIe), with each entry containing containing:
* Current PC
* Executed instruction
* Register value written (for stores to memory, the address)
* Wrapping cycle count value
* The current ASID
- Pausing and resuming the processor
- Setting breakpoints at specific PC values
- Injecting arbitrary instructions to execute

We currently don��t have, but would find very useful, (ARM and Intel equivalents have this) the ability for software to inject arbitrary data into the stream from software, for example changes to VM mappings and the current PID from the kernel.

I mostly use this for post-mortem debugging and have written an LLVM-based GUI tool that we use to inspect traces of a few tens of GBs. Our trace allows us to reconstruct all of the values in registers quite quickly (even short traces often have a context switch fairly early on, which gives us everything), but it would be nice if the start of a trace could have a complete dump of all register values. The format of our traces is parameterised in our trace analysis library (C++ template with a thing that knows how to transform the on-disk trace format into something in memory) and would likely be easy to adapt to RISC-V (assuming a slightly less immature LLVM implementation). I discussed this briefly with Yunsup a couple of weeks ago.

I��ve not used the online debugging features very often, but for post-mortem debugging of OS and compiler issues it��s been invaluable to be able to easily explore large traces. I��m very glad we didn��t have to try to bring up a software stack without this support.

David

Rishiyur Nikhil

unread,

Nov 20, 2015, 11:50:03 AM11/20/15

to David Chisnall, Tim Newsome, hw-dev

For us (at Bluespec), debugging starts with "Tandem Verification".

This means (as Jamey Hicks and David Chisnall described, checking

each instruction-retirement against a golden model, catching divergences

immediately on the bad instruction). This includes heuristics to

re-sync at non-deterministic points (RDCCYLE, timer interrupts,

external interrupts, reading uninitialized memory, ...).

Then, we continue debugging at the generic ISA level. For this, we use

a direct remote-gdb interface, running the standard GDB RSP (Remote

Serial Protocol). This is both in simulation and on hardware. In

both simulation and hardware, the GDB RSP commands are served directly

in hardware (i.e., we implemented the 'gdbstub', which attaches to the

processor being debugged, directly in hardware). We can do source-level

debugging, BREAK near the faulting instruction, single step, etc., which

gives us a clue as to what might have gone wrong.

Finally, we drop down to the micro-arch level for more detail. For

this, we use $displays and VCD waveforms in simulation, and we use VCD

waveforms in hardware (we have infrastructure to record and dump

waveforms from FPGA to a host), with no limit since we can stall

the hardware clock to keep up with the dump stream.

In general, we don't use JTAG directly, although we could use it as a

transport from some of the above.

Nikhil

Tim Newsome

unread,

Nov 20, 2015, 12:46:42 PM11/20/15

to Ray VanDeWalker, hw-...@lists.riscv.org

Do you know where I might be able to read that standard? I haven't even been able to find a place where I could buy it.

I'm all for implementing existing standards if they suit our needs.

Tim

Tim Newsome

unread,

Nov 20, 2015, 1:31:03 PM11/20/15

to Richard Herveille, Tommy Thorn, hw-...@lists.riscv.org

On Fri, Nov 20, 2015 at 12:36 AM, Richard Herveille <richard....@roalogic.com> wrote:

So these are all cool and awesome ideas and make a lot of sense for big systems, but we are in the microcontroller segment … and when the debug system gets bigger than the actual CPU something is off.

So how about creating a list of increasingly complexer/more powerful debug facilities? Ranging from bare minimum (i.e. single step), to slightly more complex (HW breakpoints/watchpoints), … to full per-thread traces.

Agreed. I'm doing my best to keep the debug interface scalable from a bare minimum (halt/run/step, read/write registers/memory) to more full featured, just like the ISA itself.

I know NEXUS follows a similar approach; but NEXUS closed up. You need to pay to get even basic access to their specs. Which contradicts why we chose to opt for the RISC-V (free). Although the 1999 release is still available on the web and was released for free.

Thanks. I finally found something to read with the hint about 1999.

Also I would suggest to split the debug controller from the actual interface. There’s strong case to have the same debug system work over JTAG, USB, Ethernet, ….
What definitely needs to be defined is the set of registers/features the CPU must support. Then the behaviour, states, and features of the debug controller, then how it interfaces to multiple physical interfaces.

We are currently working on a JTAG interface, but there’s no reason the debug controller couldn't communicate via USB. Reason why we’re currently focused on JTAG is because that interface is simply available on most (if not all) designs and platforms we work on and adding additional pins or a different interface is not allowed by our customers.

How do you envision USB/Ethernet debugging working? Would you have a USB/Ethernet controller on-chip which can optionally become a debug controller? That sounds doable assuming you can make those services work while the core is halted. Since JTAG as a transport just provides access to a few new registers, that should be easy to accommodate in USB/Ethernet as well.

It's worth thinking about JTAG first, since it's the simplest (so, as you point out, likely to exist in most places) and slowest (so needing the most optimization) interface of them all.

Tim

Tim Newsome

unread,

Nov 20, 2015, 1:54:30 PM11/20/15

to Stefan Wallentowitz, hw-...@lists.riscv.org

On Fri, Nov 20, 2015 at 1:11 AM, Stefan Wallentowitz <stefan.wa...@tum.de> wrote:

On 20.11.2015 09:36, Richard Herveille wrote:
> I know NEXUS follows a similar approach; but NEXUS closed up. You need
> to pay to get even basic access to their specs. Which contradicts why we
> chose to opt for the RISC-V (free). Although the 1999 release is still
> available on the web and was released for free.

I fully agree on Richard's NEXUS opinion and would like to add that even
the industry that created it doesn't use it widely. Together with it
being a closed standard I think there remains no good reason to use it.

> Also I would suggest to split the debug controller from the actual
> interface. There’s strong case to have the same debug system work over
> JTAG, USB, Ethernet, ….

Exactly, I think this is the most important that it should not be a
_JTAG_ debug controller, but the debug interface controller with a
simple MMIO-ish interface.

I disagree, for the 2 reasons I mentioned in my e-mail to Richard just now:

1. JTAG is the simplest interface, so it will see the largest adoption. (Even microcontrollers might implement it.)

2. JTAG is the slowest interface, so it benefits the most from any optimization.

Having said that, JTAG isn't so weird that it's hard to use a different transport in its place. I'll see if I can at least add a hand-wavy section to the spec which suggests how to use a different transport.

> What definitely needs to be defined is the set of registers/features the
> CPU must support. Then the behaviour, states, and features of the debug
> controller, then how it interfaces to multiple physical interfaces.

Andreas and I recently started something in this direction and are at
the very beginning. Basically we start with the adv_dbg_sys interface
and plan an adapter to a rather generic MMIO-interface plus an interrupt
mechanism to signal events.

Coming back to Tim's original question, I think the most important thing
I want from a JTAG debug is that it is developed together with other
projects in the community for better interoperability.

Can you point me at any projects, preferably ones that have something at least half working?

Tim

Best,
Stefan

Tim Newsome

unread,

Nov 20, 2015, 2:25:11 PM11/20/15

to Jamey Hicks, Tommy Thorn, hw-...@lists.riscv.org

On Fri, Nov 20, 2015 at 6:23 AM, Jamey Hicks <jamey...@gmail.com> wrote:

On Nov 19, 2015, at 3:32 PM, Tim Newsome <t...@sifive.com> wrote:

On Thu, Nov 19, 2015 at 11:59 AM, Jamey Hicks <jamey...@gmail.com> wrote:
I don't think I've ever used a JTAG-based debugger, but execution traces are extremely useful.

At Nokia, what we mostly used was timestamped process ID and procedure entry/exit so that we could measure system performance. For example, looking at full stack performance including kernel, database, and application when flick-scrolling on the touchscreen so we could see where the time was going.

Here at CSAIl we are working on a RISC-V implementation in BSV and we're using online tracing of execution to verify correct execution of our implementation.

What exactly does "online" mean in this context?

By online, I mean we run the ISA simulator and the FPGA implementation in lockstep and compare the trace output without storing it anywhere.

That's interesting. I mostly think of JTAG debug as a tool for software developers to find bugs in their code, and not as a tool for hardware devs to find bugs in their implementation. With the latter in mind, is there anything that would be particularly useful to expose?

As you point out, some tooling is required. The current state of the art for tracing on production hardware is extremely difficult to use. Surely with an open ISA and open source software we can do better than that. I am completely agnostic about what the hardware interface is for configuration and control of debug, as long as it is a well documented interface for which open source tools are available.

Is there anything you're not agnostic about? (Eg. type of data collected during trace, size of trace buffer, ...)

It’s pretty clear the community needs a range of options for that, depending on the needs of the applications. Even on higher end embedded devices, I expect that there will be pressure to reduce pin counts, so we need a low pin count interface to the debug/trace functionality.

I’m sure that JTAG will be required for some applications, but I would prefer a packet oriented interface, so we can multiplex debug commands/responses, messages from the application (e.g., kernel console), and various types and levels of traces. For extremely small devices, the physical interface could be two or three wires, like SPI or I2C.

For higher performance devices, USB-3 seems like it might work. That is the direction we’re leaning if we get funding to build RISC-V prototype chips. Off-the-shelf debug/trace controllers. :)

That does sound pretty neat.

Tim

Tim Newsome

unread,

Nov 20, 2015, 2:38:03 PM11/20/15

to David Chisnall, hw-...@lists.riscv.org

On Fri, Nov 20, 2015 at 6:26 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:

On 19 Nov 2015, at 18:02, Tim Newsome <t...@sifive.com> wrote:
>
> so please let me know what you want the debug interface to do for you.

The BERI debug unit (which we talk to over JTAG) allows:

- Recording stream traces into a buffer, which can be dumped over JTAG (or PCIe), with each entry containing containing:
* Current PC
* Executed instruction
* Register value written (for stores to memory, the address)
* Wrapping cycle count value
* The current ASID
- Pausing and resuming the processor
- Setting breakpoints at specific PC values
- Injecting arbitrary instructions to execute

We currently don’t have, but would find very useful, (ARM and Intel equivalents have this) the ability for software to inject arbitrary data into the stream from software, for example changes to VM mappings and the current PID from the kernel.

You mean while the core is halted, or while it's running?

I mostly use this for post-mortem debugging and have written an LLVM-based GUI tool that we use to inspect traces of a few tens of GBs.

What are you using that lets you store tens of GBs of trace data?

Our trace allows us to reconstruct all of the values in registers quite quickly (even short traces often have a context switch fairly early on, which gives us everything), but it would be nice if the start of a trace could have a complete dump of all register values. The format of our traces is parameterised in our trace analysis library (C++ template with a thing that knows how to transform the on-disk trace format into something in memory) and would likely be easy to adapt to RISC-V (assuming a slightly less immature LLVM implementation). I discussed this briefly with Yunsup a couple of weeks ago.

I’ve not used the online debugging features very often, but for post-mortem debugging of OS and compiler issues it’s been invaluable to be able to easily explore large traces. I’m very glad we didn’t have to try to bring up a software stack without this support.

Thanks for sharing.

Tim

David

Tim Newsome

unread,

Nov 20, 2015, 2:47:07 PM11/20/15

to Rishiyur Nikhil, David Chisnall, hw-dev

Thanks for an overview of your workflow. I suspect that JTAG isn't great for that because its bandwidth is so limited, but I'll keep it in mind.

Tim

Stefan Wallentowitz

unread,

Nov 20, 2015, 2:55:54 PM11/20/15

to Tim Newsome, hw-...@lists.riscv.org

On 20.11.2015 19:54, Tim Newsome wrote:
> Exactly, I think this is the most important that it should not be a
> _JTAG_ debug controller, but the debug interface controller with a
> simple MMIO-ish interface.
>
>
> I disagree, for the 2 reasons I mentioned in my e-mail to Richard just now:
> 1. JTAG is the simplest interface, so it will see the largest adoption.
> (Even microcontrollers might implement it.)
> 2. JTAG is the slowest interface, so it benefits the most from any
> optimization.

My statement was not against JTAG. What others and I are saying is that
a debug interface definition is needed, the control/access scheme
defined (MMIO) and the question how to access registers is independent
of this (let it be JTAG, SWP, Aurora, ..). I think your question was in
the direction of the first and second mainly, right?

>
> Can you point me at any projects, preferably ones that have something at
> least half working?

There was a discussion between Richard and Andreas a week ago about
adv_debug_sys-based interfaces. This is a good starting point for a
discussion, I think.

Best,
Stefan

Richard Herveille

unread,

Nov 20, 2015, 3:33:48 PM11/20/15

to Tim Newsome, Richard Herveille, Tommy Thorn, hw-...@lists.riscv.org

How do you envision USB/Ethernet debugging working? Would you have a USB/Ethernet controller on-chip which can optionally become a debug controller? That sounds doable assuming you can make those services work while the core is halted. Since JTAG as a transport just provides access to a few new registers, that should be easy to accommodate in USB/Ethernet as well.

JTAG in its essence is a simple serial protocol that requires a few dedicated registers. That can easily be done via USB, even without having a CPU working I think.

Ethernet might be a bit more problematic to do without a CPU I guess, especially since it would be really neat if the communication is over TCP or UDP. But again all you need to do is mimic the JTAG regs.

RIchard

Richard Herveille

unread,

Nov 20, 2015, 3:35:28 PM11/20/15

to Tim Newsome, Richard Herveille, Stefan Wallentowitz, hw-...@lists.riscv.org

Can you point me at any projects, preferably ones that have something at least half working?

My code is stored here:

https://github.com/RoaLogic/adv_dbg_if

It’s more than half working, though I haven’t hooked it up to our CPU yet.

The current plan is to have the CPU + JTAG Debug working on an FPGA in December.

Richard

Tim

Best,
Stefan

Tim Newsome

unread,

Nov 20, 2015, 4:51:22 PM11/20/15

to Stefan Wallentowitz, hw-dev

On Fri, Nov 20, 2015 at 11:55 AM, Stefan Wallentowitz <stefan.wa...@tum.de> wrote:

On 20.11.2015 19:54, Tim Newsome wrote:
> Exactly, I think this is the most important that it should not be a
> _JTAG_ debug controller, but the debug interface controller with a
> simple MMIO-ish interface.
>
>
> I disagree, for the 2 reasons I mentioned in my e-mail to Richard just now:
> 1. JTAG is the simplest interface, so it will see the largest adoption.
> (Even microcontrollers might implement it.)
> 2. JTAG is the slowest interface, so it benefits the most from any
> optimization.

My statement was not against JTAG. What others and I are saying is that
a debug interface definition is needed, the control/access scheme
defined (MMIO) and the question how to access registers is independent
of this (let it be JTAG, SWP, Aurora, ..).

I mostly agree with that. The only thing I'd add is that because JTAG is common and slow, it makes sense to optimize the control/access scheme with it in mind. For instance, in JTAG writing a 4-bit register is faster than writing a 16-bit one, so it's worth keeping frequently-used registers small.

I think your question was in
the direction of the first and second mainly, right?

That's right.

>
> Can you point me at any projects, preferably ones that have something at
> least half working?

There was a discussion between Richard and Andreas a week ago about
adv_debug_sys-based interfaces. This is a good starting point for a
discussion, I think.

There were some interesting bits in that thread. Thanks.

Tim

Best,
Stefan

Tim Newsome

unread,

Nov 20, 2015, 4:53:48 PM11/20/15

to Richard Herveille, Tommy Thorn, hw-dev

On Fri, Nov 20, 2015 at 12:33 PM, Richard Herveille <richard....@roalogic.com> wrote:

How do you envision USB/Ethernet debugging working? Would you have a USB/Ethernet controller on-chip which can optionally become a debug controller? That sounds doable assuming you can make those services work while the core is halted. Since JTAG as a transport just provides access to a few new registers, that should be easy to accommodate in USB/Ethernet as well.

JTAG in its essence is a simple serial protocol that requires a few dedicated registers. That can easily be done via USB, even without having a CPU working I think.
Ethernet might be a bit more problematic to do without a CPU I guess, especially since it would be really neat if the communication is over TCP or UDP. But again all you need to do is mimic the JTAG regs.

I completely agree.

Tim

David Chisnall

unread,

Nov 21, 2015, 4:29:00 AM11/21/15

to Tim Newsome, hw-...@lists.riscv.org

On 20 Nov 2015, at 19:38, Tim Newsome <t...@sifive.com> wrote:
>
> You mean while the core is halted, or while it's running?

While the core is paused. This is useful for dumping other information (e.g. doing a load at a particular address with $0 as the destination will put the value at that memory address into the stream trace).

>
>> I mostly use this for post-mortem debugging and have written an LLVM-based GUI tool that we use to inspect traces of a few tens of GBs.
>
> What are you using that lets you store tens of GBs of trace data?

Hard disks. There��s a small RAM buffer that we dump via JTAG over USB or PCIe.

David

Madhu (Macaque Labs)

unread,

Nov 21, 2015, 6:23:56 AM11/21/15

to David Chisnall, Tim Newsome, Hw-dev

In general, USB, Eth or PCIe are a pain for smaller cores.

Increases complexity and size and all three need PHY support.

The best compromise is a SPI variant. Quad SPI or the new Octal

SPI is much simpler to implement. The octal does need DDR

and I think std quad SPI would suffice for most cases.

Since we plan to release a quad SPI as part of our open source IP

portfolio, the community can avoid having to license anything.

If as proposed we have a standard interface agnostic interface we can use JTAG or a variant like this

for the physical transport.

We need to finalize this pretty soon. We have a bunch of real world SoCs coming up

and probably a start-up that may use our cores, so would like to standardize on something

sooner than later.

--

Regards,

Madhu

Tim Newsome

unread,

Nov 23, 2015, 1:23:25 PM11/23/15

to David Chisnall, hw-dev

Neat. I didn't realize a solution that slow would be interesting to people.

Tim

David

Tim Newsome

unread,

Nov 23, 2015, 1:26:04 PM11/23/15

to Madhu (Macaque Labs), David Chisnall, Hw-dev

On Sat, Nov 21, 2015 at 3:23 AM, Madhu (Macaque Labs) <ma...@macaque.in> wrote:

In general, USB, Eth or PCIe are a pain for smaller cores.
Increases complexity and size and all three need PHY support.

The best compromise is a SPI variant. Quad SPI or the new Octal
SPI is much simpler to implement. The octal does need DDR
and I think std quad SPI would suffice for most cases.
Since we plan to release a quad SPI as part of our open source IP
portfolio, the community can avoid having to license anything.

If as proposed we have a standard interface agnostic interface we can use JTAG or a variant like this
for the physical transport.

That's what I'll end up doing.

We need to finalize this pretty soon. We have a bunch of real world SoCs coming up
and probably a start-up that may use our cores, so would like to standardize on something
sooner than later.

I'm hoping to have a spec close-to-final in January. There may be tweaks still at that time but I doubt they would affect major functionality.

Tim

Reply all

Reply to author

Forward