multi-kernel support with reconos

Guanwen (Henry) Zhong

unread,

Feb 27, 2017, 10:15:34 PM2/27/17

to ReconOS

Hi Christoph,

Does the ReconOS provide multi-kernel version? For example, kernel 1 could have 2 hw slots, while kernel 2 could have 4 hw slots (there is no communication between those hw slots). Currently, I am working on "develop" branch and haven't looked "develop_ic" branch in detail. Looks like "develop_ic" branch has these features. Correct?

Thanks,

Guanwen

Christoph Rüthing

unread,

Feb 28, 2017, 2:38:43 PM2/28/17

to rec...@googlegroups.com

Hi Guanwen,

I am not sure if I have understood you question correctly and what you are referring to with "kernel". Basically, you just have the abstraction of hardware threads and can create them arbitrarily as you need them. So it is no problem to create two hardware threads for one task and four other ones for a different task. You can use a separate set of resource for each set of hardware threads ad restrict communication by this. ReconOS itself does not forces you on a specific layout but you can design your application as you would like it. Does that answer you question?

The develop_ic branch does not introduce a new concept there, it just implements some further communication mechanisms to allow hardware threads direct communication in hardware without involving the processor (as it was done before using the delegate threads).

Yours,
Christoph

--
You received this message because you are subscribed to the Google Groups "ReconOS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reconos+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Guanwen (Henry) Zhong

unread,

Feb 28, 2017, 10:43:13 PM2/28/17

to ReconOS, chri...@muppetnet.net

> Does that answer you question?

Yes. Thanks Christoph for the information. :-)

"kernel" I mentioned is your "task" above.

Initially, I tried to build multiple tasks with the support of rdk by changing "build.cfg" (add more "[HwSlot@task1], [HwSlot@task2], ..."). But rdk failed to generate edk with multiple tasks. I am modifying rdk to have this.

As you said, "You can use a separate set of resource for each set of hardware threads ad restrict communication by this."

Does it mean that in "build.cfg", I need to create multiple "ResourceGroup". For example:

----------

...

[HwSlot@task1(0:1)]

...

[ResourceGroup@Resources1]

...

[HwSlot@task2(0:3)]

...

[ResourceGroup@Resources2]

...

----------

I still have a question on memory controller, "reconos_memif_memory_controller_v1_00_a". I set up reconos with multiple (let's say "N") hw slots (single task) and successfully run on zedboard. However, as those "N" hw slots are sharing the memory controller through the "memif_arbiter", the data transfer becomes the bottleneck (speedup is saturated to a number no matter how we increase number of hw slots) especially when computation in each hw slot is quite fast comparing to communication. Do you face this problem before when you have multiple hw slots? I am trying to update the memory controller, so that each task can have their own memory controller.

Thanks,

Guanwen

chri...@muppetnet.net

unread,

Mar 2, 2017, 4:32:04 PM3/2/17

to ReconOS

Hi Guanwen,

yes, I ment to create multiple slots and hardware threads in the build.cfg. However, I am a little bit surprised that it does not work directly, sind I know that others already used multiple slots/threads. Currently the implementation of RDK requires to set the IDs of slots correctly, maybe that was wrong at your conifg? Nevertheless it is fone if it works for you. I can try it out the next days for me.

Regarding the resources you are also right. It should work by simply creating multiple groups and assigning them to the hardware threads.

Your question about the memory interface is a little bit more complicated. Currently it is just ment to be a single interface for all hardware threads. Creating multiple interfaces for each hardware thread would work but you still need to arbitrate for the actual memory access (only a single onterfsce througg acp, otherwise you need to use the non cache-coherent axi ports). Maybe you can do this arbitration at a higher clock frequency.

However, in general we often faced the problem of communication overhead. Not necessarily due to the memory interface but mainly due to the communication via the processor. That’s why we experimented in the develop_ic branch with direct communication between hardware threads without involving the processor. Of course, this does not solve your problem of having a bottleneck in the memory access. In general, tasks wich have a high communication overhead might not be perefectly suitable to be accelerated on the FPGA, instead longer running and parallelizable algorithms are more interesting.

Yours,

Christoph

Guanwen (Henry) Zhong

unread,

Mar 2, 2017, 11:16:50 PM3/2/17

to ReconOS, chri...@muppetnet.net

Thanks Christoph.

> Currently the implementation of RDK requires to set the IDs of slots correctly, maybe that was wrong at your conifg?

If it works for you, then it is the problem of my build.cfg. Never mind, I can fix this. Currently, I know how to do this multi-kernel thing by manually changing reconos api.

> Currently it is just ment to be a single interface for all hardware threads. Creating multiple interfaces for each hardware thread would work but you still need to arbitrate for the actual memory access (only a single onterfsce througg acp, otherwise you need to use the non cache-coherent axi ports).

Yes, you are right. Multiple slots share memory controller and mmu. Those slots are communicated with the memory components through the arbiter unit. I am going to duplicate the memory controller and arbiter, but reuse the mmu as it contains the TLB.

> However, in general we often faced the problem of communication overhead. Not necessarily due to the memory interface but mainly due to the communication via the processor.

In my current design, I make the hw slot keep sending memory requests. Thus, the communication between processor and hw is not that much in my case (different from the way that sort demo sends and receives data). The bottleneck in my design so far comes from the shared memory controller. Actually, it also happens to the sort demo. The reason we can not see this problem in sort demo is that the computation part of sort demo is without any optimization and its computation time is much longer that communication. If we keep adding optimizations (eg. array partitioning, pipelining with HLS) on compute-part of sort demo, then communication will dominate the performance. Therefore, I am planning to add a memory controller to each task(kernel) with multiple slots. Looks like it is not difficult to do that. :-)