[SThorm] Communication failure: ENOSPC

Patrick Bellasi

unread,

Mar 11, 2013, 1:39:12 PM3/11/13

to Germain Haugou, bosp-...@googlegroups.com

Hi Germain!

We are proceeding with BBQ-SThorm integration tests.

Right now we have these workloads available:

1. a MV version running standalone

2. a MV version integrated with BBQ

3. an "experimental" version of MV integrated with BBQ.

The first two verisons are running running properly, while the latter has a problem which will be presented by Edoardo in a mail to follow.

With this email I'm focusing just on the second version to report an issue we are currently facing and which I guess it could be probably related to the BBQ SThorm PIL.

Fortunately we are "lucky" since the problem is reproducible: if we run four time the MV application, the last time the execution fails with BBQ stating that there was an (unknown) error on sending a notification to the platform. If I restart BBQ (not the board), than I can still run three time the application and the fourth one we get the same communication error.

The interesting point is that in the kernel log I get this message:

/work1/p12_comp/jenkins/workspace/SDK_base_ubuntu/src/linux-driver/kernel_module/src/p2012_ld.c, usr_ioctl(2068) : ERROR : return -ENOSPC

which, according to some grep on the SDK, it reads as "No space left on device"... thus my guess: are we consuming up all the HOST-SThorm shared memory?

Thus I suspect that either or we don't use properly the p2012 communication API or somehow internally is not working properly.

Questions:

1. could this problem be related to the HOST-SThorm shared memory managed by the p2012 device driver?

2. should be somehow menage this shared memory from the BBQ side, e.g. release message passing buffers?!?

I've noticed that the we was not using the p2012_getNextMessage, thus I suspected that the input queue could be filled up and thus the communication error (even if eventually not completely reasonable according to my view).

Thus, I've added an event loop processing to the PIL, where I continuosuly check for new messages from the input queue.

This added two more problems:

1. I was expecting the getNextMessage method to be a blocking call, instead it seems to return immediately, with

p2012_ld.c, p2012_ld_read(995) : ERROR (msgSize > count): return -EINVAL

(at least) when no messages are available in input

=> I think we should add support for a poll syscall on the p2012 device driver

2. still I'm getting the NOSPC error after few runs of the application.

I would like to provide a better investigation on these issues, but unfortunately I don't have the device driver source code, which should be quite useful on these issues.

I propose to exploit the recently added GIT support to the Codendi website to share the device driver source code.

Ok, that's almost all for the time. Any input from your side is more than welcome... meanwhile I'll go on with some more debugging/experiments.

Ciao Patrick

--
#include <best/regards.h>

Patrick Bellasi
Post-Doc at Politecnico di Milano
http://home.dei.polimi.it/bellasi

Germain HAUGOU

unread,

Mar 11, 2013, 4:51:20 PM3/11/13

to Patrick Bellasi, bosp-...@googlegroups.com

Hi Patrick,

indeed I think the message queue should become full after several application runs. For now the OpenCL runtime is not using the messages because it is using not migrating tasks during execution. What is missing is a mechanism to skip the messages in case they are not needed. I will add that soon. For now can you deactivate the posting of those messages ?

Ciao,

Germain

Patrick Bellasi

unread,

Mar 12, 2013, 7:09:36 AM3/12/13

to Germain HAUGOU, bosp-...@googlegroups.com

On Mon, Mar 11, 2013 at 9:51 PM, Germain HAUGOU <germain...@st.com> wrote:

Hi Patrick,

Hi Germain!

indeed I think the message queue should become full after several application runs. For now the OpenCL runtime is not using the messages because it is using not migrating tasks during execution.

We also are not migrating tasks right now. But simply running _four_ time (i.e. not several) the same application, one after the previous, produces that issue. We are scheduling a single application each time and that application is not migrated nor reconfigured.

Actually the issue is related to the p2012_sendMessage, this is the call which returns ENOSPC.

This is the driver log when BBQ create the queues:

[ 3674.920000] p2012_daemon_ioctl(1589) : case P2012_P2012D_OPEN [ 3674.950000] p2012_daemon_ioctl(1606) : p2012Priv = d85bac00
[ 3674.990000] p2012_allocSysQueues PrivateHandle(1096) : p2012Priv = d85bac00
[ 3675.020000] p2012_daemon_ioctl(1299) : case P2012_CREATE_QUEUE from(5) to (4) [ 3675.060000] p2012_daemon_ioctl(1335) : data.back_link= (null)
[ 3675.090000] p2012_setProcess(485) : backlink= (null)
[ 3675.120000] p2012_daemon_ioctl(1299) : case P2012_CREATE_QUEUE from(4) to (5) [ 3675.160000] p2012_daemon_ioctl(1335) : data.back_link= (null)
[ 3675.190000] p2012_setProcess(485) : backlink= (null) _SEND_MSG
[ 3675.240000] p2012_daemon_ioctl(1435) : case P2012_BBQ_INIT [ 3675.270000] p2012_daemon_ioctl(1440) : case P2012_BBQ_INIT handle = 70500000
[ 3675.310000] p2012_daemon_ioctl(1441) : case P2012_BBQ_INIT size = 524288, 0x80000
[ 3675.340000] p2012_ld_mmap(2285) : vm_pgoff=460032=0x70500 [ 3675.370000] p2012_ld_mmap(2286) : offset=1884291072=0x70500000
[ 3675.400000] p2012_ld_mmap(2287) : vma_size=524288=0x80000
[ 3675.430000] p2012_ld_mmap(2319) : curr->busAddr=0x70500000 [ 3675.460000] p2012_ld_mmap(2320) : curr->userLogicalAddr=0x (null)
[ 3675.500000] p2012_ld_mmap(2322) : curr->busAddr=0x70500000
[ 3675.530000] p2012_ld_mmap(2323) : curr->userLogicalAddr=0x (null) [ 3675.560000] p2012_ld_mmap(2373) : userLogicalAddr=b66f2000
[ 3675.590000] p2012_ld_mmap(2374) : busAddr=0x70500000
[ 3675.620000] p2012_ld_mmap(2375) : kernelLogicalAddr=e1500000

I have the suspect that the BBQtoSThorm queue is not read by the OCL Run-Time side.. thus the queue filling up and ENOSPC returned.

This suspect is also motivated by the fact that we observed no variations on performances when running the MultiView application with 25% or 100% time quota.

Could it be that the run-time on the last SDK you sent does not integrate the time quota management features?

What is missing is a mechanism to skip the messages in case they are not needed.

On that point I still have to better understand the way you manage the queues.

I'm having a look at the device driver code...

I will add that soon. For now can you deactivate the posting of those messages ?

You mean to not post messages from BBQ to SThorm?!?

We use such posting to notify the device-side OCL Run-Time that a constraint set has been updated... don't you need that information?

Is you current implementation of the time-sharing control just "polling" each constraint set from time to time?

Ok, I've to look at the p12runtime code ASAP...

Ciao,

Germain

Ciao Patrick

By the way, is there a version of FaceDetect working with the new SDK?

... I would like to integrate it with the BBQ-Android RTLib, perhaps it could be an interesting demo of BBQ-SThorm for DATE.

Germain HAUGOU

unread,

Mar 13, 2013, 9:51:33 AM3/13/13

to Patrick Bellasi, bosp-...@googlegroups.com

Hi Patrick,

by migration , I mean that once a command has started execution on the fabric, it is not taking into account the quota anymore because we don’t want to preempt things. The quota are taken into account only when there is a scheduling choice to be taken. This is why notifications from the host are not used by OpenCL and thus the message FIFOs gets full.

The notifications may be used by NPM, in case only OpenCL is executed, I will make sure the messages are taken from the FIFO.

Face detection sources are not in the SDK anymore because of some issues with critical code that we integrated from another division. However you can still find the binaries inside the SDK. Do you need the sources of only the binaries is sufficient ?

For what concerns the Android demonstration it has been removed because it was not compatible anymore with recent modifications made on Face detection kernels and we did not have time to port it.

Ciao,

Germain

From: Patrick Bellasi [mailto:derk...@gmail.com]

Sent: Tuesday, March 12, 2013 12:10 PM
To: Germain HAUGOU
Cc: bosp-...@googlegroups.com

Patrick Bellasi

unread,

Mar 13, 2013, 10:53:51 AM3/13/13

to Germain HAUGOU, bosp-...@googlegroups.com

On Wed, Mar 13, 2013 at 2:51 PM, Germain HAUGOU <germain...@st.com> wrote:

Hi Patrick,

Hi Germain!

by migration , I mean that once a command has started execution on the fabric, it is not taking into account the quota anymore because we don’t want to preempt things.

Ok, I agree and know the reasons about that point...

The quota are taken into account only when there is a scheduling choice to be taken. This is why notifications from the host are not used by OpenCL and thus the message FIFOs gets full.

Ok, yes, I've got that point as well. Indeed right now I've disabled notifications to and from the BBQ PIL. With that configuration, apart from DMA failures, BBQ seems to be able to run multiple times the same application.

However, still I have a question. If we run a single instance of an application: constraints are considered on not? We observe the same performances when running the same application in two different AWM, e.g. with 25% or 100% of fabric time... but this is not what we would expect.

I have the suspect that the OCL run-time considers BBQ define constraints JUST and ONLY if there are more than one application running concurrently. Could you confirm that point?

You could ask: why we would expect constraints to be respected even when a single application is running? The point was to enable the possibility to have a power/performance thread-off policy running at the host side.

Such kind of policies are possible only if we are able to enforce even just a single OCL application not to use more than a certain amount of fabric time.

A possible "workaound", to get the same result without modifying the current OCL run-time implementation, is to let BBQ schedule a sort of OCL "idle" task. This task will be assigned the fabric time quota we want to "sleep". Question: is it available a sleep call for kernels running on xp70? Such a call requires just to setup an timer interrupt and than enter a WFI status... but I don't know if a programmable timer running while the xp70 is in WFI is available.

The notifications may be used by NPM, in case only OpenCL is executed, I will make sure the messages are taken from the FIFO.

Ok, this has to be better discussed... probably we have to identify a "final" design regarding the SThorm-BBQ communication needs...

Face detection sources are not in the SDK anymore because of some issues with critical code that we integrated from another division.

Ok... unfortunately...

However you can still find the binaries inside the SDK. Do you need the sources of only the binaries is sufficient ?

For the integration with BBQ-RTLib, yes, I need sources... at least for the Java code.

If the actual image processing algo is provided in a native shared library, than I can simply link the .so...

For what concerns the Android demonstration it has been removed because it was not compatible anymore with recent modifications made on Face detection kernels and we did not have time to port it.

Ok, so I think for the time being we have to give up with the idea to integrate the FaceDetect, and see if a different Android application could be integrated... maybe MultiView... but I doubt it will be ready for DATE. :-/

Ciao,

Germain

Ciao Patrick

Germain HAUGOU

unread,

Mar 13, 2013, 2:28:04 PM3/13/13

to Patrick Bellasi, bosp-...@googlegroups.com

Hi Patrick,

the quota should work even when there is a single application running. I remember I tested this scenario on Gepop some times ago and after a few fixes in Gepop (idle waiting on timer interrupt was not working), this was working well. I think there is something wrong with the timer on the board. I will have access to the board only on Friday, I’m going to check that.

Ciao,

Germain

From: Patrick Bellasi [mailto:derk...@gmail.com]
Sent: Wednesday, March 13, 2013 3:54 PM
To: Germain HAUGOU
Cc: bosp-...@googlegroups.com
Subject: Re: [SThorm] Communication failure: ENOSPC

On Wed, Mar 13, 2013 at 2:51 PM, Germain HAUGOU <germain...@st.com> wrote:

Patrick Bellasi

unread,

Mar 14, 2013, 5:21:04 AM3/14/13

to Germain HAUGOU, bosp-...@googlegroups.com

On Wed, Mar 13, 2013 at 7:28 PM, Germain HAUGOU <germain...@st.com> wrote:

Hi Patrick,

Hi Germain!

the quota should work even when there is a single application running. I remember I tested this scenario on Gepop some times ago and after a few fixes in Gepop (idle waiting on timer interrupt was not working), this was working well.

Good, that's a good news! Indeed, that's the proper and expected behavior for the run-time control effectiveness.

I'll give a test using the Gopop sim just to check that it is still working on that platform.

I think there is something wrong with the timer on the board. I will have access to the board only on Friday, I’m going to check that.

Ok, let me know if I could support this investigation somehow.

Ciao,
Germain

Thanks and Ciao

Patrick

Reply all

Reply to author

Forward