Is it a fatal error if "g_cpu_irqset" is "3" on two cores(cortex-a7) smp mode?

112 views
Skip to first unread message

tugouxp tugouxp

unread,
Dec 29, 2019, 9:39:22 PM12/29/19
to NuttX
HI folks:

  below log is Nuttx running on two cores cortex-a7 arch smp mode, you can notice that every time sched_addreadytorun was 
called, the g_cpu_irqset is 3.

 but from the design of the principle of this object, it seems should not happen for tow bits set to one!

thank you!

 sched_addreadytorun: irqset cpu 1, me 0 btcbname init, irqset 1 irqcount 2.
 .sched_addreadytorun: sched_addreadytorun line 338 g_cpu_irqset = 3.
 enter_critical_section: count,count 1, rtn = 0x4002a950.
 enter_critical_section: count,count 1, rtn = 0x4002a4ec.
 enter_critical_section: count,count 1, rtn = 0x40035768.
 enter_critical_section: count,count 1, rtn = 0x40055194.
 enter_critical_section: count,count 1, rtn = 0x40035768.
 enter_critical_section: count,count 2, rtn = 0x400563d0.
 enter_critical_section: count,count 1, rtn = 0x4002a950.
 enter_critical_section: count,count 1, rtn = 0x400313e4.

Gregory Nutt

unread,
Dec 29, 2019, 9:51:25 PM12/29/19
to nu...@googlegroups.com
  below log is Nuttx running on two cores cortex-a7 arch smp mode, you can notice that every time sched_addreadytorun was 
called, the g_cpu_irqset is 3.

 but from the design of the principle of this object, it seems should not happen for tow bits set to one!

I can happen, but only under a very certain condition. g_cpu_irqset only exists to support this certain condition:

1. A task running on CPU 0 takes the critical section.  So g_cpu_irqset == 0x1.

2. A task exits on CPU 1 and a waiting, ready-to-run task is re-started on CPU 1.  This new task also holds the critical section.  So when the task is re-restarted on CPU 1, we than have g_cpu_irqset == 0x3

So we are in a very perverse state!  There are two tasks running on two different CPUs and both hold the critical section.  I believe that is a dangerous situation and there could be undiscovered bugs that could happen in that case.  However, as of this moment, I have not heard of any specific problems caused by this weird behavior.

From what you are saying it is causing a assertion to fire.  I have not seen that before either.

Greg


tugouxp tugouxp

unread,
Dec 29, 2019, 10:19:32 PM12/29/19
to nu...@googlegroups.com
did this mean the global critical section has failure to protected resources shared by multi cores?
although fatal error not met for specific reasons till now, but according moff`s law, error must  happen if it is possible to ha pend, 
it is maybe not fully tested this case, can i understand this way?

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/5fac7114-b7e9-e6a4-be55-b40ad9deda9e%40gmail.com.

tugouxp tugouxp

unread,
Dec 29, 2019, 10:33:41 PM12/29/19
to NuttX

Gregory Nutt

unread,
Dec 29, 2019, 10:53:40 PM12/29/19
to nu...@googlegroups.com

It is late here, I will have to look at the code tomorrow to have anything to say.

Masayuki Ishikawa

unread,
Dec 29, 2019, 11:54:18 PM12/29/19
to nu...@googlegroups.com
Dear Greg,

I remember that we introduced COFIG_ARCH_GLOBAL_IRQDISABLE if the processor support a maskable IPI.
On both LC823450 and CXD56XX, this option is automatically selected in arm/Kconfig.
And I think this should be applied to Cortex-A MPCore once it supports a new IPI.

Also, I introduced CONFIG_SPINLOCK_IRQ to replace some critical section APIs with spin lock APIs to improve performance and I enabled this configuration on the above platforms as well. However, I think this configuration does not relate to this issue.

Thanks,
Masayuki

2019年12月30日(月) 12:53 Gregory Nutt <spud...@gmail.com>:

Gregory Nutt

unread,
Dec 30, 2019, 1:17:05 PM12/30/19
to nu...@googlegroups.com

> It is late here, I will have to look at the code tomorrow to have
> anything to say.
>
This is probably a real problem.  The restarted task should really
re-acquire the critical section before it continues.

Another option that we are considering (off-list) is to include
disabling of pre-emption as part of the critical section logic. The
problem should dissappear if per-emption were also disabled. And it
seems to me, that this should be integrated into
enter_critical_section() so that is does both.

There are some complexities that make this not quite as straight-forward.


patacongo

unread,
Jan 1, 2020, 11:16:00 AM1/1/20
to NuttX

This is probably a real problem.  The restarted task should really
re-acquire the critical section before it continues.

Another option that we are considering (off-list) is to include
disabling of pre-emption as part of the critical section logic. The
problem should dissappear if per-emption were also disabled. And it
seems to me, that this should be integrated into
enter_critical_section() so that is does both.

There are some complexities that make this not quite as straight-forward.

A possible solution would be to add a new task state that would exist only for SMP.

  • Add a new SMP-only task list and state.  Say, g_csection_wait[].  It should be prioritized.
  • When a task acquires the critical section, all tasks in g_readytorun[] that need the critical section would be moved to g_csection_wait[].
  • When any task is unblocked for any reason and moved to the g_readytorun[] list, if that unblocked task needs the critical section, it would also be moved to the g_csection_wait[] list.  No task that needs the critical section can be in the ready-to-run list if the critical section is not available.
  • When the task releases the critical section, all tasks in the g_csection_wait[] needs to be moved back to g_readytorun[].
  • This may result in a context switch.  The tasks should be moved back to g_readytorun[] higest priority first.  If a context switch occurs and the critical section to re-taken by the re-started task, the lower priority tasks in g_csection_wait[] must stay in that list.
That is really not as much work as it sounds.  It is something that could be done in 2-3 days of work if you know what you are doing.  Getting the proper test setup would be the more difficult task.

patacongo

unread,
Jan 1, 2020, 4:49:19 PM1/1/20
to NuttX


A possible solution would be to add a new task state that would exist only for SMP. ...

NOTE that this issue was previously documented here:  https://github.com/apache/incubator-nuttx/blob/master/TODO#L586

In that analysis, it states that this problem should only be possible if CPU affinity is used.  Are you using CPU affinity?  Are you seeing the problem without CPU affinity?  That would chang the documented conditions of how this could happen and, perhaps, the solution that I provided in the preceding email would be the correct one?

Let me know,
Greg

tugouxp tugouxp

unread,
Jan 3, 2020, 3:02:07 AM1/3/20
to NuttX
i did not use the cpu affinity in my case, i did not invoke any interface related the affinity.
all the tasks use the default affinity that 0x03, means then can be running on any core.

is this not your expected!?

Gregory Nutt

unread,
Jan 3, 2020, 9:17:55 AM1/3/20
to nu...@googlegroups.com

> i did not use the cpu affinity in my case, i did not invoke any
> interface related the affinity.
> all the tasks use the default affinity that 0x03, means then can be
> running on any core.
Thanks.  I will update the issue in the TODO list with this information.
> is this not your expected!?

That analysis in the TODO list is very old and, according to your
report, incorrect.

I am not working with SMP now and have not for a couple of years.  The
topic is complex and I am not sufficiently up to speed to be 100%
confident in a solution now (although I think the changes that I
outlined in a preceding email should be correct).

Greg


patacongo

unread,
Jan 4, 2020, 10:37:31 AM1/4/20
to NuttX
I am implementing the fix on a fork.  This will probably take one, maybe two days.  I will try to setup my old i.MX6 to test.  I would appreciate if you could help with the testing when the change is ready.

patacongo

unread,
Jan 4, 2020, 12:36:51 PM1/4/20
to NuttX


I am implementing the fix on a fork.  This will probably take one, maybe two days.  I will try to setup my old i.MX6 to test.  I would appreciate if you could help with the testing when the change is ready.

@masayuki You seem to believe that these problems would go away if the support for the ICCMPR register were implemented.  Is that true?  We should talk more before I go implementing additional OS task state logic.

Gregory Nutt

unread,
Jan 4, 2020, 2:32:13 PM1/4/20
to nu...@googlegroups.com, d...@nuttx.apache.org
[Including d...@nuttx.apache.org, I apologize in advance to those 100 or so who get duplicates].

Attached is a PDF for the behavior that I am considering to implement in the OS.  I would appreciate if anyone with familiarity with SMP in NuttX could comment.

If the attachment does not show up in the d...@nuttx.apache.org email, then please check this thread:  https://groups.google.com/forum/#!topic/nuttx/2dpzttQbVlk

Greg

On 1/4/2020 11:36 AM, patacongo wrote:


I am implementing the fix on a fork.  This will probably take one, maybe two days.  I will try to setup my old i.MX6 to test.  I would appreciate if you could help with the testing when the change is ready.

@masayuki You seem to believe that these problems would go away if the support for the ICCMPR register were implemented.  Is that true?  We should talk more before I go implementing additional OS task state logic.

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
csection.pdf

Masayuki Ishikawa

unread,
Jan 5, 2020, 1:04:34 AM1/5/20
to NuttX
Greg,

I'm not exactly sure that this issue relates to a current IPI handling for Cortex-A MPCore but I propose to modify the IPI handling for Cortex-A MPCore to be same as Cortex-M because the SMP related logic will be more simple and consistent.

Masayuki

Masayuki Ishikawa

unread,
Jan 5, 2020, 11:43:53 PM1/5/20
to nu...@googlegroups.com
Greg,

Today I confirmed that I can reproduce this issue on both Spresene (6 cores Cortex-M4F but only dual cores are enabled) and K210 (Dual RV64GC) in SMP mode. (NOTE: K210 SMP is still under testing but I want to release the code soon). So this is not a Cortex-A specific issue but a NuttX SMP issue.

What I did is that:

1. Add the following code at line 339 to check whether only one bit is set in g_cpu_irqset.

--- a/sched/sched/sched_addreadytorun.c

+++ b/sched/sched/sched_addreadytorun.c

@@ -335,6 +335,8 @@ bool sched_addreadytorun(FAR struct tcb_s *btcb)

 

               spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,

                           &g_cpu_irqlock);

+

+              ASSERT((g_cpu_irqset & (g_cpu_irqset - 1)) == 0);

             }

 



2. Then run the hello app on nsh then stopped with assertion on CPU0.
3. So attached openocd to CPU0 and add a breakpoint to up_assert() and run hello app again.

On CPU0, hello app has just exited and tried to unblock a task (btcb shows "init") but asserted at line 339 in sched_readytorun() which I just added.

(gdb) where                                                                                                                          

#0  up_assert (filename=filename@entry=0xd01885a "sched/sched_addreadytorun.c", lineno=lineno@entry=339) at armv7-m/up_assert.c:432  

#1  0x0d00789a in sched_addreadytorun (btcb=btcb@entry=0xd026db0) at sched/sched_addreadytorun.c:339                                 

#2  0x0d006cb2 in up_unblock_task (tcb=0xd026db0) at armv7-m/up_unblocktask.c:88                                                     

#3  0x0d003a24 in nxsem_post (sem=sem@entry=0xd029114) at semaphore/sem_post.c:166                                                   

#4  0x0d004ae8 in nxtask_exitwakeup (status=218271616, tcb=0xd028f80) at task/task_exithook.c:547                                    

#5  nxtask_exithook (tcb=0xd028f80, status=status@entry=0, nonblocking=nonblocking@entry=0 '\000') at task/task_exithook.c:677       

#6  0x0d00415c in exit (status=0) at task/exit.c:95                                                                                  

#7  0x0d004132 in nxtask_start () at task/task_start.c:150                                                                           

#8  0x00000000 in ?? ()                                                                                                              

(gdb) up                                                                                                                             

#1  0x0d00789a in sched_addreadytorun (btcb=btcb@entry=0xd026db0) at sched/sched_addreadytorun.c:339                                 

339                   ASSERT((g_cpu_irqset & (g_cpu_irqset - 1)) == 0);                                                              

(gdb) p *btcb                                                                                                                        

$1 = {flink = 0xd022a60 <g_idletcb+380>, blink = 0x0, group = 0xd026f40, pid = 3, start = 0xd004109 <nxtask_start>, entry = {pthread = 0xd00774d <spresense_main>, main = 0xd00774d <spresense_main>}, sched_priority = 100 'd', init_priority = 100 'd', task_state = 4 '\004', cpu = 1 '\001', affinity = 3 '\003', flags = 0, lockcount = 2, irqcount = 1, timeslice = 20, waitdog = 0x0, adj_stack_size = 2028, stack_alloc_ptr = 0xd027220, adj_stack_ptr = 0xd027a08, waitsem = 0x0, sigprocmask = 0, sigwaitmask = 0, sigpendactionq = {head = 0x0, tail = 0x0}, sigpostedq = {head = 0x0, tail = 0x0}, sigunbinfo = {si_signo = 0 '\000', si_code = 0 '\000', si_errno = 0 '\000', si_value = {sival_int = 0, sival_ptr = 0x0}}, msgwaitq = 0x0, pthread_data = {0x0, 0x0, 0x0, 0x0}, pterrno = 0, xcp = {sigdeliver = 0x0, saved_pc = 0, saved_basepri = 0, saved_xpsr = 0, regs = {218265720, 128, 218272020, 218262960, 224, 25, 1, 0, 1, 0, 4294967273, 0 <repeats 16 times>, 2, 218263092, 218271748, 0, 110, 218118601, 218131740, 1090519040, 0 <repeats 17 times>, 218118601}}, name = "init", '\000' <repeats 27 times>}                                                                                            

(gdb)                                                                                                                                


On CPU1, it seems that CPU1 received a CPU pause (IPI) message from CPU0.
Actually this message was sent at line 279 in sched_addreadytorun.c

(gdb) where                                                                                                                          

#0  up_testset () at armv7-m/gnu/up_testset.S:101                                                                                    

#1  0x0d003a60 in spin_lock (lock=lock@entry=0xd022707 <g_cpu_wait+1> "\001") at semaphore/spinlock.c:89                             

#2  0x0d001368 in up_cpu_paused (cpu=1) at chip/cxd56_cpupause.c:232                                                                 

#3  0x0d002e2e in irq_dispatch (irq=irq@entry=112, context=context@entry=0xd026cbc) at irq/irq_dispatch.c:176                        

#4  0x0d001570 in up_doirq (irq=112, regs=0xd026cbc) at armv7-m/up_doirq.c:86                                                        

#5  0x0d0002f8 in exception_common () at armv7-m/gnu/up_exception.S:213                                                              

Backtrace stopped: previous frame identical to this frame (corrupt stack?)                                                           


To analyze the sequence, I saved the instrumentation buffer to a file.

  857: 0b 02 64 00 03 00 50 05 00 00 06       CPU0 PID  3: SUSPEND

  868: 0a 03 00 00 00 00 50 05 00 00          CPU0 PID  0: RESUME

  878: 0b 02 00 00 00 00 6b 05 00 00 03       CPU0 PID  0: SUSPEND

  889: 0a 03 64 00 03 00 6b 05 00 00          CPU0 PID  3: RESUME

  899: 0b 02 64 00 03 00 6b 05 00 00 06       CPU0 PID  3: SUSPEND

  910: 0a 03 00 00 00 00 6b 05 00 00          CPU0 PID  0: RESUME

  920: 0b 02 00 00 00 00 9d 05 00 00 03       CPU0 PID  0: SUSPEND

  931: 0a 03 64 00 03 00 9d 05 00 00          CPU0 PID  3: RESUME

  941: 10 00 64 00 05 00 9d 05 00 00 68 65 6c 6c 6f 00 CPU0 PID  5: START

  957: 0b 02 64 00 03 00 9d 05 00 00 06       CPU0 PID  3: SUSPEND

  968: 0a 03 64 00 05 00 9d 05 00 00          CPU0 PID  5: RESUME

  978: 0b 06 64 00 05 00 9d 05 00 00 01       CPU0 PID  5: CPU_PAUSE

  989: 0b 02 00 01 01 00 9d 05 00 00 04       CPU1 PID  1: SUSPEND

 1000: 0a 07 00 01 01 00 9d 05 00 00          CPU1 PID  1: CPU_PAUSED


Here, CPU1 is waiting for g_cpu_wait spinlock but it's OK because CPU0 has just asserted and CPU0 can not call up_cpu_resume(cpu1).

However, I'm still not sure why CPU1 was in critical section because CPU1 was executing idle task (PID1) and waiting for an interrupt.

===
Masayuki

2020年1月5日(日) 2:36 patacongo <spud...@gmail.com>:


I am implementing the fix on a fork.  This will probably take one, maybe two days.  I will try to setup my old i.MX6 to test.  I would appreciate if you could help with the testing when the change is ready.

@masayuki You seem to believe that these problems would go away if the support for the ICCMPR register were implemented.  Is that true?  We should talk more before I go implementing additional OS task state logic.

Masayuki Ishikawa

unread,
Jan 6, 2020, 12:18:25 AM1/6/20
to nu...@googlegroups.com
Greg,

I think this is a global IRQ control logic bug?

Before calling spin_setbit() g_cpu_irqset was 1 (i.e. cpu0 owns irqset)
Here, cpu is 1, so g_cpu_irqset is set to 3 after this call.

image.png

Masayuki

2020年1月6日(月) 13:43 Masayuki Ishikawa <masayuki...@gmail.com>:

Masayuki Ishikawa

unread,
Jan 6, 2020, 12:39:50 AM1/6/20
to nu...@googlegroups.com
Greg, sorry again.

I remember that this is (perhaps) an intended logic and g_cpu_irqset on this cpu is released in sched_resumesucheduler(). So I think this is a correct behavior.

Masayuki

2020年1月6日(月) 14:18 Masayuki Ishikawa <masayuki...@gmail.com>:

Gregory Nutt

unread,
Jan 6, 2020, 8:49:06 AM1/6/20
to nu...@googlegroups.com

> I think this is a global IRQ control logic bug?

I am continuing (very slowly) with the  design that I posted in the PDF
file a couple of days ago.  My concern is only that it might have
performance implications since it could potentially involve a lot of
list movement.


Gregory Nutt

unread,
Jan 6, 2020, 9:12:03 AM1/6/20
to nu...@googlegroups.com

I remember that this is (perhaps) an intended logic and g_cpu_irqset on this cpu is released in sched_resumesucheduler(). So I think this is a correct behavior.

I think a simple way to duplicate the problem on a processor with 2 CPUs would be like:

1. Start high priority Task A on CPU 0 which does:

   g_count = 0;
   nxsem_init(&sem, 0, 0);

   flags = enter_critical_section();
   nxsem_wait(&sem);
   g_count++;
   leave_critical_section(flags);

Task A will then be sleeping and expects to hold the critical section.

2. Start another slightly lower pririty Task B on CPU 1 which does:

   flags = enter_critical_section();
   nxsem_post(&sem);
   g_count++;
   leave_critical_section(flags);

Task B will run on CPU 1 and for a brief moment, both should hold the critical section.  I would expect:

  1. Task B before enter_critical_section():  g_cpu_irqset == 0x00
  2. Task B before enter_critical_section():  g_cpu_irqset == 0x01
  3. After Task B is suspended by nxsem_wait:  g_cpu_irqset == 0x00
  4. Task A before enter_critical_section():  g_cpu_irqset == 0x00
  5. Task B after enter_critical_section():  g_cpu_irqset == 0x01
  6. Task B before after_sem_post():  g_cpu_irqset == 0x03
  7. Task A after leave_critical_section():  g_cpu_irqset == 0x01
  8. Task B after leave_critical_section():  g_cpu_irqset == 0x00

In this case, Task A and Task B may interfere with each other.  g_count++ is not atomic.  It consists of at least 3 steps:  fetch, add, store.  In this case, they should be happening at about the same time on each CPU and I could not predict the value of g_count at the end.  It could be 1 or 2, depending of the relative timing of the fetch and store instructions.

Timing of the context switch might be such that they never interfere with each other (and the result would always be 2), but if a longer interfering sequence were there then they should interfere (maybe just incrementing a volatile g_count 100 times),.

It would be nice if we could confirm that this problem exists.  No one, other the the author of this post, has ever seen it (although I have suspect issues here for a long time).  So it is possible that there may be some protections, but I just don't know where they are.  SMP is kind of complex.

Greg


Gregory Nutt

unread,
Jan 6, 2020, 10:19:49 AM1/6/20
to nu...@googlegroups.com

> I remember that this is (perhaps) an intended logic and g_cpu_irqset
> on this cpu is released in sched_resumesucheduler(). So I think this
> is a correct behavior.

I am not sure.

Since CPU 0 must hold the critical section to start a task on CPU 1 
that requires the critical section, then I think that there are
certainly transitional times when the critical section will held on two
CPUs.  I don't think you can escape that.

But how can you know this is just a transitional state?

Per the previous discussions, the current critical section is a "big
lock": There is one critical section that applies to everything.  You
added a different, mostly redundant lock for access to the task lists. 
One for accessing task lists and one for everything else.  I think this
is not cleanly separated now and would be pretty difficult to detangle,
I think.

But the task list lock could be interpreted as just a transient lock and
the big critical section lock could be interpreted as the more
persistent lock if we were properly segregate the use of the locks.

Then you would need only the task list lock to preform context switches
and the big critical section lock could be controlled perhaps using
logic from the PDF I previously attached?

Greg


Phani Kumar

unread,
Jan 6, 2020, 12:31:27 PM1/6/20
to nu...@googlegroups.com, d...@nuttx.apache.org
Hi,

I am trying to port the USBHOST driver to RX65N controller. Basically I am using lpc17_40_usbhost.c as reference (as this contains all the functions of framework (xxxx_usbhost_initialize()) and host controller functions (such as xxxx_wait(), xxxx_enumerate() etc)) in one file itself. Other reference such as imxrt also follows same but the framework related code is in (/boards/.../src/imxrt_usbhost.c:) and host controller related code is in /arch/.../imxrt_ehci.c.
I have some basic doubts and your thoughts/ comments/ suggestion would help me.
1. Basically all the references invariably refer to EHCI/ OHCI. But the USB host in RX65N controller which we are using does not mention any thing about this EHCI/ OHCI. My understanding is it should not matter but still - Is there any things specific to be configured with respect to Nuttx for these EHCI/ OHCI? RX65N contains USB peripheral with USB 2.0 complaint and Full Speed.
2. What is IOBUFFERS with respect to USB Host? lpc17_40_usbhost.c refers to LPC17_40_IOBUFFERS (and used as some #define) Is is the USB memory dedicated for USB transfer? Or can we use normal RAM itself? What is the specific use of IOBUFFER and how it needs to be configured?
3. We see xxxx_ioalloc and xxxx_iofree as well as xxxx_alloc and xxxx_free? What is the difference between them? Is it that, xxxx_ioalloc and xxxx_iofree functions are used with IOBUFFER? If we don't want to use them, just implementing xxxx_alloc and xxxx_free is sufficient/ OK?
4. Many places, it is seen the debug code is wrapped between CONFIG_DEBUG_USB_INFO, this depends on CONFIG_DEBUG_USB and I could not find where to enable this option (not sure, if I missed something in KConfig).
As I continue, possibly there will some more doubts/ some more clarity :) 
With best regards,
Phani.

tugouxp tugouxp

unread,
Jan 13, 2020, 8:34:25 PM1/13/20
to nu...@googlegroups.com
yes, i would like to help do the test, so how the progress now? thank you!

On Sat, Jan 4, 2020 at 11:37 PM patacongo <spud...@gmail.com> wrote:
I am implementing the fix on a fork.  This will probably take one, maybe two days.  I will try to setup my old i.MX6 to test.  I would appreciate if you could help with the testing when the change is ready.

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.

Gregory Nutt

unread,
Jan 13, 2020, 8:53:47 PM1/13/20
to nu...@googlegroups.com
yes, i would like to help do the test, so how the progress now? thank you!

I have run into some design obstacles and I am not working on it now.

The basic problem with my origin idea is that it is necessary to have more than one task hold the critical section at least momentarily:

  • A task on CPU 0 starts a task on CPU 1.  In order to modify the task lists, it must hold the critical section momentarily.
  • The new task started on CPU 1 may also hold the critical section for some other purpose.
  • The task on CPU 0 will immediately release the critical section after the task on CPU 1 is started with no interference.

In this case, it is perfectly normal for a both tasks on different CPUs to hold the lock.  But it makes it impossible to distinguish between the case where both CPUs will hold the lock for a long period of time and interfere with each other.

I think that a correct solution would need to separate that single big lock into to locks.  One that behaves as the normal critical section and one that just protects the tasks lists with a spinlock.  Masayuki has already implemented this second lock for task lists only.  But I don't think it is being used strictly enough at present.

If the two lock uses were separated cleanly, then it would be easy to detect if starting a new task would interfere because it holds the critical section.  Right not that is not possible.

Changing all of the locking in the sched/ would be a big job and not something that I want to do right now.

Greg



Reply all
Reply to author
Forward
0 new messages