could this flow be happen in smp mode?

tugouxp tugouxp

unread,

Jan 13, 2020, 6:09:18 AM1/13/20

to NuttX

CPU0 (task0) CPU1 (task1, idle) g_cpu_irqset

1: .......... enter_critical_section g_cpu_irqset = 2

........... ......... ................

2. ........... yield to idle（so release spinlock) g_cpu_irqset = 0

(irqcount == 1)

.......... irqcount=1 g_cpu_irqset = 0

3. enter_critical_section irqcount =1 g_cpu_irqset = 1

(because spin lock free by step2)

4. ......... irqcount=1 g_cpu_irqset = 1

5. sched_addreadytorun(task1) irqcount=1 g_cpu_irqset = 3?

So, at this point, two tasks task0 and task 1 all goes into critical section?

Step1: cpu0 runs task0 and cpu1 runs task1 and idle, task0 and task1 with affinity 0x03 means then can run either cores.

cpu1 in task1 context, first goes into critical section with "enter_critical_senction" call and task1`s irqcount incrment to 1.

step2. for some reaons, cpu1 happens a context switch, from task1 to idle, because of idle task did not catch the lock, so, the spinlock and g_cpuirqset

rleased by "sched_removereadytorun"

int sched_removereadytorun(.....)

{

//spin_clrbit() will be done in sched_resumescheduler()

}

int sched_resumescheduler(.....)

{

spin_clrbit(.....) ----->here the lock release! but with irqcount still 1.

}

step3: because of the lock release in cpu1, so the next "enter_critical_section" in cpu0 would be success.

AT this point, the atomic has already been broken of the critical section.

step4: cpu1,task1 irqcount 1, g_cpu_irqset =1.

step5: the task1 is scheded to ready by cpu0, sched_addreadytorun was called.

int sched_addreadytorun(.....)

{

330 ¦ ¦ if (btcb->irqcount > 0)

331 ¦ ¦ {

332 ¦ ¦ ¦ /* Yes... make sure that scheduling logic on other CPUs knows

333 ¦ ¦ ¦ * that we hold the IRQ lock.

334 ¦ ¦ ¦ */

336 ¦ ¦ ¦ spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,

337 ¦ ¦ ¦ ¦ ¦ ¦ &g_cpu_irqlock);

338 ¦ ¦ /*_alert("%s line %d g_cpu_irqset = %d.\n", __func__, __LINE__, g_cpu_irqset);*/

339 ¦ ¦ }

}

SO, At this point, the cpu0 help to recover the task1`s lock status by retake the spinlock and g_cpu_irqset to 3 during task0 still take the critical section resources!!!

Would this sequence be hapend?????

Thanks for your kindly help.

We a doing a project which is very important to our customers with smp mode, so we have to know the logic stability of SMP to responsble for our client.

thanks again!

tugouxp tugouxp

unread,

Jan 14, 2020, 6:33:01 AM1/14/20

to NuttX

This sequence definitely happen, and i am trying to fix it, but before that, i need to get more knowledge of the SMP implementation.

See below:

two tasks stopped on the same while(abc) which protected by critical seciton.

tugouxp tugouxp

unread,

Jan 14, 2020, 6:35:54 AM1/14/20

to NuttX

Gregory Nutt

unread,

Jan 14, 2020, 10:23:02 AM1/14/20

to nu...@googlegroups.com

Yes, that behavior happens normally in the "healthy" case. But it can also happen in an "unhealthy" case.

See https://groups.google.com/forum/#!topic/nuttx/2dpzttQbVlk my emails dated January 6 and 13.

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/e2e7cfbd-370f-4b32-b115-3f0a2636aeaa%40googlegroups.com.

tugouxp tugouxp

unread,

Jan 14, 2020, 8:30:28 PM1/14/20

to NuttX

what do you mean the term "healthy" and "unhealthy" case? can you explain more clearly?

thank you!

To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

Gregory Nutt

unread,

Jan 14, 2020, 9:13:50 PM1/14/20

to nu...@googlegroups.com

what do you mean the term "healthy" and "unhealthy" case? can you explain more clearly?
thank you!

The healthy and normal case is:

CPU 0 takes critical section so that it can change the task lists. (g_cpu_irqset = 0x01, irqcount == 1)
CPU 0 starts a task on CPU 1. The task on CPU 1 needs the critical section (g_cpu_irqset = 0x03. This is completely normal).
CPU 0 releases the critical section (g_cpu_irqset = 0x02, irqcount == 0);

Actually, there may be nesting so that irqcount is greater than 1 at step 1. That is all still healthy if CPU 0 unwinds and releases the full irqcount so that g_cpu_irqset = 0x02, irqcount == 0 when it is finished modying the task list.

In the unhealthy case, irqcount would be > 1 at step 1 and after CPU 0 releases the irqcount, it is still greater than zero and g_cpu_irqset is still 0x03.

I sent you a test case that should illustrate the unhealthy case (if it exists). My first email on January 6 here: https://groups.google.com/forum/#!topic/nuttx/2dpzttQbVlk . The unhealthy case has not been proven to exist, but if it does it is only the unhealthy case that needs to be fixed

The problem, as you see, is that there can always be multiple CPUs holding the critical section at the time of the context switch. So it is impossible to distinguish the healthy case from the unhealthy case.

That is why I suggested that we need two different critical section locks: One general one and one to control access to the task lists, only. There are, in fact, already two such locks. But I don't think they operate independently enough to be useful right now for this purpose now.

But I think that the first step to changing the unhealthy behavior would be to change the task list locking so that we can at least distinguish the healthy case from the unhealthy case. If we can do that, then the rest of the problem is easy to solve.

Greg

tugouxp tugouxp

unread,

Jan 15, 2020, 4:15:22 AM1/15/20

to NuttX

thanks for your reply.

i have a little puzzle about the healthy mode:

2. CPU 0 starts a task on CPU 1. The task on CPU 1 needs the critical section (g_cpu_irqset = 0x03. This is completely normal).

it seems the this cant happened because of the global lock owned by cpu0 and would released in by step3, but it is busy now because of taken by cpu0.

so cpu1 would spin till step 3.

the race case not happend so pervasive if follow normal routine but for deliberate disobey.

but it is still a fatal race condition because it must be happend on complexity applications.

Gregory Nutt

unread,

Jan 15, 2020, 8:15:59 AM1/15/20

to nu...@googlegroups.com

i have a little puzzle about the healthy mode:

2. CPU 0 starts a task on CPU 1. The task on CPU 1 needs the critical section (g_cpu_irqset = 0x03. This is completely normal).

it seems the this cant happened because of the global lock owned by cpu0 and would released in by step3, but it is busy now because of taken by cpu0.

so cpu1 would spin till step 3.

No in this case. There is no spinlock involved when a task is restarted that holds the critical section. If the spinlock is already locked, it just sets the bit in g_cpu_irqset.

patacongo

unread,

Jan 15, 2020, 9:45:10 AM1/15/20

to NuttX

I think I understand your confustion. Let me add add a couple more things (in blue).

The healthy and normal case is:

CPU 0 takes critical section so that it can change the task lists. (g_cpu_irqset = 0x01, irqcount == 1)

CPU 0 stops CPU 1 via an inter-processor interrupt. CPU 1 waits on a spinlock inside the inter-processor interrupt handler

CPU 0 starts a task on CPU 1.

CPU 0 starts the task on CPU 1 by setting up data data structures so that a context switch will occur occur when CPU 1 returns form the interrupt handler.

The task on CPU 1 needs the critical section (g_cpu_irqset = 0x03. This is completely normal).

Remember that CPU 1 is not running. It cannot wait for anything. CPU 0 has complete control of the task lists and is managing this condition
After setting up the context switch, CPU 0 resumes CPU 1.
The inter-processor interrupt on CPU 1 returns, thus performing the context switch to the new task that is now running on CPU 1

CPU 0 releases the critical section (g_cpu_irqset = 0x02, irqcount == 0);

That is all good and normal. It is the unhealthy case that is not controlled.

Greg

tugouxp tugouxp

unread,

Jan 15, 2020, 9:10:53 PM1/15/20

to nu...@googlegroups.com

and what about my case above, is it heathy or unhealthy?

all both cpus enter the same ctiritcal region, if the while(abc" replaced by protected "count ++"

it is obvious each cpu has a full access rights of count, so race condition must happen.

Gregory Nutt

unread,

Jan 15, 2020, 9:22:42 PM1/15/20

to nu...@googlegroups.com

> and what about my case above, is it heathy or unhealthy?

Your example is "unhealthy". This is not the "healthy" case where the
lock is held only to access task list structures.

But the first step in addressing this would be to have some way of
distinguishing the healthy vs. unhealthy case. The cannot be
distinguished now because the same lock is used for general OS logic as
is used for locking access to the task lists.

If the two cases could be distinguished, then I have a good design that
would eliminate the issue. If they cannot be distinguished, then I
don't think there is any solution.

Greg

tugouxp tugouxp

unread,

Jan 15, 2020, 9:47:10 PM1/15/20

to nu...@googlegroups.com

seems the unhealthy case is definitely a bug.

thank you, and i would learn from this.

--
You received this message because you are subscribed to the Google Groups "NuttX" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/368000da-edb0-20fd-5910-0d4f14b113f7%40gmail.com.

Masayuki Ishikawa

unread,

Jan 15, 2020, 9:56:49 PM1/15/20

to nu...@googlegroups.com

Hi,

>seems the unhealthy case is definitely a bug.

>thank you, and i would learn from this.

Could you provide the test code and conditions so that we can reproduce and analyze the bug?

Thanks,

Massayuki

2020年1月16日(木) 11:47 tugouxp tugouxp <tug...@gmail.com>:

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/CAAhLDMauVHg6kEPDPFcxtw1B%2BuTcdikDEBfH%3DeNyFfLPz14B8g%40mail.gmail.com.

Gregory Nutt

unread,

Jan 15, 2020, 9:58:36 PM1/15/20

to nu...@googlegroups.com

> Hi,
>
> >seems the unhealthy case is definitely a bug.
> >thank you, and i would learn from this.
>
> Could you provide the test code and conditions so that we can
> reproduce and analyze the bug?
>
> Thanks,
> Massayuki

It is in a .png image in one of the emails in this thread.

Gregory Nutt

unread,

Jan 15, 2020, 10:06:54 PM1/15/20

to nu...@googlegroups.com

> >seems the unhealthy case is definitely a bug.
> >thank you, and i would learn from this.
>
> Could you provide the test code and conditions so that we can
> reproduce and analyze the bug?

It would be interesting if only Cortex-A had this problem. But I don't
think that will be the case. I think the problem is inherent in the design.

But it would be great if I were wrong.

Greg

Masayuki Ishikawa

unread,

Jan 15, 2020, 10:11:19 PM1/15/20

to nu...@googlegroups.com

> It is in a .png image in one of the emails in this thread.

Thanks, but I'd like to know how each function (i.e. main_test1, main_test0) is called.

It seems that he added the code into nx_bringup.c

I want to see the whole code.

2020年1月16日(木) 11:58 Gregory Nutt <spud...@gmail.com>:

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/3493df68-c752-c5fa-8867-e3873636ec90%40gmail.com.

Masayuki Ishikawa

unread,

Jan 15, 2020, 10:27:41 PM1/15/20

to nu...@googlegroups.com

>> It is in a .png image in one of the emails in this thread.

>

>Thanks, but I'd like to know how each function (i.e. main_test1, main_test0) is called.

>It seems that he added the code into nx_bringup.c

>

>I want to see the whole code.

We prefer test code which can run on nsh.

Thanks,

Masayuki

2020年1月16日(木) 12:11 Masayuki Ishikawa <masayuki...@gmail.com>:

Gregory Nutt

unread,

Jan 15, 2020, 10:32:47 PM1/15/20

to nu...@googlegroups.com

We prefer test code which can run on nsh.

I haven't looked at this carefully, but I think this could be implemented as two builtin tasks;

The first runs and blocks on nxsem_wait()
The second calls nxsem_post() to demonstrate the problem.

This, of course, would have to violate the POSIX interface; applications don't get to have critical sections or to call nxsem_* interfaces. But you can violate those interfaces for the purposes of testing.

Greg

tugouxp tugouxp

unread,

Jan 16, 2020, 12:49:43 AM1/16/20

to nu...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/e966f44a-90ef-7879-96fb-0745962d95bb%40gmail.com.

nx_bringup.c

Masayuki Ishikawa

unread,

Jan 16, 2020, 1:52:10 AM1/16/20

to nu...@googlegroups.com

I'm a bit confused with the code you added in nx_bringup.c.

You also added while (abc); in both threads (perhaps) to debug kernel resources?

And what is your intension for this test?

I think this test is to check if two tasks are synchronized with one semaphore

I've just added this test case (slightly modified not to use while (1) for example) as csection_test.c under ostest

and the test seems works without any problems so far on Sony Spresense in dual core mode.

Masayuki

2020年1月16日(木) 14:49 tugouxp tugouxp <tug...@gmail.com>:

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/CAAhLDMZ26e3B17indXLqSUNxjQMOoEDzbTTNG94Qo%2BE-6cgXyg%40mail.gmail.com.

Masayuki Ishikawa

unread,

Jan 16, 2020, 2:20:44 AM1/16/20

to nu...@googlegroups.com

By the way, did you try ostest on your environment (dual Cortex-A7)?

There are many tests including semaphore.

Masayuki

2020年1月16日(木) 15:51 Masayuki Ishikawa <masayuki...@gmail.com>:

Gregory Nutt

unread,

Jan 16, 2020, 8:55:33 AM1/16/20

to nu...@googlegroups.com

> I've just added this test case (slightly modified not to use while (1)
> for example) as csection_test.c under ostest
> and the test seems works without any problems so far on Sony Spresense
> in dual core mode.

Hmm.. So I think we need to run the identical test on both platforms.
We need to determine if the unhealthy behavior is unique to the Cortex-A
and we can only do that by running the identical test on both platforms.

I would offer to help, but I can tied up on other things right now.

tugouxp tugouxp

unread,

Jan 16, 2020, 8:48:59 PM1/16/20

to NuttX

i have try ostest first in cortex-a7 platfrom, but it is easy to crash and assert in SMP, but very well on single CPU mode. which i have proposed a issue on the forum.

i make while(abc) means to stuck on the race point, the main_task1 and main_task0 each is bind to cpu1 and cpu0 .

so, if the two threads stuck here at the same time, this means the critical section has failure.

On Thursday, January 16, 2020 at 3:20:44 PM UTC+8, Masayuki Ishikawa wrote:

By the way, did you try ostest on your environment (dual Cortex-A7)?
There are many tests including semaphore.

Masayuki

2020年1月16日(木) 15:51 Masayuki Ishikawa <masayuki...@gmail.com>:

I'm a bit confused with the code you added in nx_bringup.c.
You also added while (abc); in both threads (perhaps) to debug kernel resources?

And what is your intension for this test?
I think this test is to check if two tasks are synchronized with one semaphore

I've just added this test case (slightly modified not to use while (1) for example) as csection_test.c under ostest
and the test seems works without any problems so far on Sony Spresense in dual core mode.

Masayuki

2020年1月16日(木) 14:49 tugouxp tugouxp <tug...@gmail.com>:

On Thu, Jan 16, 2020 at 11:32 AM Gregory Nutt <spud...@gmail.com> wrote:

We prefer test code which can run on nsh.

I haven't looked at this carefully, but I think this could be implemented as two builtin tasks;

The first runs and blocks on nxsem_wait()

The second calls nxsem_post() to demonstrate the problem.

This, of course, would have to violate the POSIX interface; applications don't get to have critical sections or to call nxsem_* interfaces. But you can violate those interfaces for the purposes of testing.

Greg

--
You received this message because you are subscribed to the Google Groups "NuttX" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/e966f44a-90ef-7879-96fb-0745962d95bb%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "NuttX" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

tugouxp tugouxp

unread,

Jan 16, 2020, 8:50:51 PM1/16/20

to NuttX

and, if, the while(abc) replace by "count ++" it would easily to see the race conditions .

On Thursday, January 16, 2020 at 3:20:44 PM UTC+8, Masayuki Ishikawa wrote:

By the way, did you try ostest on your environment (dual Cortex-A7)?
There are many tests including semaphore.

Masayuki

2020年1月16日(木) 15:51 Masayuki Ishikawa <masayuki...@gmail.com>:

I'm a bit confused with the code you added in nx_bringup.c.
You also added while (abc); in both threads (perhaps) to debug kernel resources?

And what is your intension for this test?
I think this test is to check if two tasks are synchronized with one semaphore

I've just added this test case (slightly modified not to use while (1) for example) as csection_test.c under ostest
and the test seems works without any problems so far on Sony Spresense in dual core mode.

Masayuki

2020年1月16日(木) 14:49 tugouxp tugouxp <tug...@gmail.com>:

On Thu, Jan 16, 2020 at 11:32 AM Gregory Nutt <spud...@gmail.com> wrote:

We prefer test code which can run on nsh.

I haven't looked at this carefully, but I think this could be implemented as two builtin tasks;

The first runs and blocks on nxsem_wait()

The second calls nxsem_post() to demonstrate the problem.

This, of course, would have to violate the POSIX interface; applications don't get to have critical sections or to call nxsem_* interfaces. But you can violate those interfaces for the purposes of testing.

Greg

--
You received this message because you are subscribed to the Google Groups "NuttX" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/e966f44a-90ef-7879-96fb-0745962d95bb%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "NuttX" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

spudaneco

unread,

Jan 16, 2020, 9:08:09 PM1/16/20

to nu...@googlegroups.com

Sent from Samsung tablet.

i have try ostest first in cortex-a7 platfrom, but it is easy to crash and assert in SMP, but very well on single CPU mode. which i have proposed a issue on the forum.

The Cortex A SMP is not tested often, so it is very common to see the ostest broken after a long time. It usually has to be debugged to bring back. So that is common. The ostest worked well the last time I used it a couple of years ago.

I don't believe that has anything to do with the other test case you have been talking about.

tugouxp tugouxp

unread,

Jan 16, 2020, 9:09:47 PM1/16/20

to NuttX

thanks, but "If the spinlock is already locked, it just sets the bit in g_cpu_irqset."

i think this is the root behavior that difference from others rtos implementations. the restarted task set the irqlock status g_cpu_irqset and take-up the lock directly without do any judgement of the lock status throuth "spin_setbit",

but if the cpu0 owned the spinlock present, so after the "spin_setbit" was call, who is responsible for own the "global lock"

in my thought, the critical section should follow the semantics of it`s design, only one execution flow in the critical sections at the same time during the whole system, no matter user app, driver or even kernel bundries.

the restarted task should do the re-take opertions if its "irqcount" is not zero, should not change the status to clarify its own status, because the lock may be busy by cpu0 now.

Masayuki Ishikawa

unread,

Jan 16, 2020, 9:11:43 PM1/16/20

to nu...@googlegroups.com

OK, I undestand.

Because your test case is very basic. So without passing this scenario, ostest would never pass.

Currently I can run ostest on LC823450 (dual Cortex-M3), Spresense (6 cores Cortex-M4F but

only 2 cores are enabled to test) and K210 (dual RV64GC with I/D caches).

So I think something wrong with (perhaps) Cortex-A SMP implementation but unfortunately

I have no Cortex-A MPCore board to check.

By the way, you are using SMP capable debugging tool, could you tell me the tool name?

Thanks,

Masayuki

2020年1月17日(金) 10:49 tugouxp tugouxp <tug...@gmail.com>:

To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/06aaebf4-127f-4dc2-bc0c-910b1925b868%40googlegroups.com.

Xiang Xiao

unread,

Jan 16, 2020, 9:27:11 PM1/16/20

to NuttX

Hi,

Any sleep/wait in criticial section auto lose the sync, so you can not sleep before you finish to use the share resource to make your demo work. Your demo also break in single CPU case too.

Thanks

Xiang

tugouxp tugouxp <tug...@gmail.com> 于 2020年1月16日周四下午5:50写道：

To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/367c6a26-2873-4047-92de-b6f160b7f41d%40googlegroups.com.

tugouxp tugouxp

unread,

Jan 16, 2020, 9:28:04 PM1/16/20

to NuttX

i am using the dstream, usually called ds5, arm develop tools.

but i also use openocd + Jlink tools sometimes, the difference is the IDE graphic view, i thinks this is not so matter.

but DS5 is more stability than openocd, i think.

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/06aaebf4-127f-4dc2-bc0c-910b1925b868%40googlegroups.com.

Masayuki Ishikawa

unread,

Jan 16, 2020, 10:32:36 PM1/16/20

to nu...@googlegroups.com

Thanks for the tool information.

To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/a952b641-9be0-47b5-97ec-a362dc926c92%40googlegroups.com.

tugouxp tugouxp

unread,

Jan 16, 2020, 11:30:32 PM1/16/20

to NuttX

Thanks for your advice.

But i think this is related some design policy. In Linux, though it is not suggested, if you do yield in critical section(spinlocks, preempt disable usually), it would be success and the lock status recover when task switch back(extra re-take again)

so,all the zone protected by critical would be safe.

on non-smp mode, this routine is ready to run and perfect health, because each task owns a cpsr status to recover when switch back. two flow cant happens in the same "critical regions"

my confusing is root at the key in "sched_addreadytorun", Why the "global spin lock" just set busy directly according the next ready tcb "irqcount", with out think about "the current status of global spin lock",? with out re-tack the resource,

it would be reflect to user level applications. unless, hierarchy locks used for each single resource, or limited the "critical_secntion" operation available in kernel level, or there must be occasions that violate the "atomic schematics" . that my thought.

thanks for your advice.

330

331 ¦ ¦ if (btcb->irqcount > 0)

332 ¦ ¦ {

333 ¦ ¦ ¦ /* Yes... make sure that scheduling logic on other CPUs knows

334 ¦ ¦ ¦ * that we hold the IRQ lock.

335 ¦ ¦ ¦ */

336

337 ¦ ¦ ¦ _alert("irqset cpu %d, me %d btcbname %s, irqset %d irqcount %d.pc 0x%08x.\n.", \

338 ¦ ¦ ¦ ¦ ¦cpu, me, btcb->name, g_cpu_irqset, btcb->irqcount, btcb->xcp.regs[15]);

339 ¦ ¦ ¦ spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,

340 ¦ ¦ ¦ ¦ ¦ ¦ &g_cpu_irqlock);

341 ¦ ¦ ¦ _alert("%s line %d g_cpu_irqset = %d.\n", __func__, __LINE__, g_cpu_irqset);

342 ¦ ¦ }

343

Xiang

To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/367c6a26-2873-4047-92de-b6f160b7f41d%40googlegroups.com.

patacongo

unread,

Jan 18, 2020, 9:00:00 AM1/18/20

to NuttX

As I mentioned before, I have a design to solve the issue. I think it is a bad idea to start a task on a CPU only so that the task can wait on a spinlock. A better design would just not let any task run on a CPU B that needs a critical section while another CPU A holds the lock.

A better design would be to change the scheduler so that no task that requires the critical section can run until the critical section is released. I have posted the attached state transition diagram before (csection.pdf). I have fleshed out the state transtion events/actions a little more too (also attached as state=labels.txt).

That design depends upon separating the "healthy" from the "unhealthy" cases as we discussed in a different email.

csection.pdf

state=labels.txt

tugouxp tugouxp

unread,

Jan 19, 2020, 2:31:39 AM1/19/20

to NuttX

Thanks!

i need time to read the doc detail to understand the new design spirits. thanks again for your dedication.

A large scale of size and shape has formed now for nuttx, so, any function modification especially the smp related issues is not easy to dealt with.

patacongo

unread,

Jan 19, 2020, 10:14:11 AM1/19/20

to NuttX

i need time to read the doc detail to understand the new design spirits. thanks again for your dedication.

The design, however, would not work with current critical section logic. The code currently takes the critical section in order to perform a context switch. That would not work with the current design.

That is why I said before that the first step would be to use a different mechanism for protecting tasks lists during a context switch. That other mechanism already exists, see spinlock_irq_save() and spinlock_irq_restore(). That is the machanism that already exists for locking the task lists. But it is not used consistently and has some limitations.

Reply all

Reply to author

Forward