below log is Nuttx running on two cores cortex-a7 arch smp mode, you can notice that every time sched_addreadytorun wascalled, the g_cpu_irqset is 3.
but from the design of the principle of this object, it seems should not happen for tow bits set to one!
I can happen, but only under a very certain condition. g_cpu_irqset only exists to support this certain condition:
1. A task running on CPU 0 takes the critical section. So g_cpu_irqset == 0x1.
2. A task exits on CPU 1 and a waiting, ready-to-run task is re-started on CPU 1. This new task also holds the critical section. So when the task is re-restarted on CPU 1, we than have g_cpu_irqset == 0x3
So we are in a very perverse state! There are two tasks running on two different CPUs and both hold the critical section. I believe that is a dangerous situation and there could be undiscovered bugs that could happen in that case. However, as of this moment, I have not heard of any specific problems caused by this weird behavior.
From what you are saying it is causing a assertion to fire. I have not seen that before either.
Greg
--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/5fac7114-b7e9-e6a4-be55-b40ad9deda9e%40gmail.com.
It is late here, I will have to look at the code tomorrow to have
anything to say.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/CAAhLDMZtqjr-dXHgzi9G5KjZjysB3cMFmqG%3Djbr7QXKEGJB-zA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/4e0d9719-2aea-faf7-6b41-4b8b5016feab%40gmail.com.
This is probably a real problem. The restarted task should really
re-acquire the critical section before it continues.
Another option that we are considering (off-list) is to include
disabling of pre-emption as part of the critical section logic. The
problem should dissappear if per-emption were also disabled. And it
seems to me, that this should be integrated into
enter_critical_section() so that is does both.
There are some complexities that make this not quite as straight-forward.
A possible solution would be to add a new task state that would exist only for SMP. ...
I am implementing the fix on a fork. This will probably take one, maybe two days. I will try to setup my old i.MX6 to test. I would appreciate if you could help with the testing when the change is ready.
I am implementing the fix on a fork. This will probably take one, maybe two days. I will try to setup my old i.MX6 to test. I would appreciate if you could help with the testing when the change is ready.
@masayuki You seem to believe that these problems would go away if the support for the ICCMPR register were implemented. Is that true? We should talk more before I go implementing additional OS task state logic.
--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/78a4f58d-3bb2-4613-acf6-30173f34f031%40googlegroups.com.
--- a/sched/sched/sched_addreadytorun.c
+++ b/sched/sched/sched_addreadytorun.c
@@ -335,6 +335,8 @@ bool sched_addreadytorun(FAR struct tcb_s *btcb)
spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock,
&g_cpu_irqlock);
+
+ ASSERT((g_cpu_irqset & (g_cpu_irqset - 1)) == 0);
}
(gdb) where
#0 up_assert (filename=filename@entry=0xd01885a "sched/sched_addreadytorun.c", lineno=lineno@entry=339) at armv7-m/up_assert.c:432
#1 0x0d00789a in sched_addreadytorun (btcb=btcb@entry=0xd026db0) at sched/sched_addreadytorun.c:339
#2 0x0d006cb2 in up_unblock_task (tcb=0xd026db0) at armv7-m/up_unblocktask.c:88
#3 0x0d003a24 in nxsem_post (sem=sem@entry=0xd029114) at semaphore/sem_post.c:166
#4 0x0d004ae8 in nxtask_exitwakeup (status=218271616, tcb=0xd028f80) at task/task_exithook.c:547
#5 nxtask_exithook (tcb=0xd028f80, status=status@entry=0, nonblocking=nonblocking@entry=0 '\000') at task/task_exithook.c:677
#6 0x0d00415c in exit (status=0) at task/exit.c:95
#7 0x0d004132 in nxtask_start () at task/task_start.c:150
#8 0x00000000 in ?? ()
(gdb) up
#1 0x0d00789a in sched_addreadytorun (btcb=btcb@entry=0xd026db0) at sched/sched_addreadytorun.c:339
339 ASSERT((g_cpu_irqset & (g_cpu_irqset - 1)) == 0);
(gdb) p *btcb
$1 = {flink = 0xd022a60 <g_idletcb+380>, blink = 0x0, group = 0xd026f40, pid = 3, start = 0xd004109 <nxtask_start>, entry = {pthread = 0xd00774d <spresense_main>, main = 0xd00774d <spresense_main>}, sched_priority = 100 'd', init_priority = 100 'd', task_state = 4 '\004', cpu = 1 '\001', affinity = 3 '\003', flags = 0, lockcount = 2, irqcount = 1, timeslice = 20, waitdog = 0x0, adj_stack_size = 2028, stack_alloc_ptr = 0xd027220, adj_stack_ptr = 0xd027a08, waitsem = 0x0, sigprocmask = 0, sigwaitmask = 0, sigpendactionq = {head = 0x0, tail = 0x0}, sigpostedq = {head = 0x0, tail = 0x0}, sigunbinfo = {si_signo = 0 '\000', si_code = 0 '\000', si_errno = 0 '\000', si_value = {sival_int = 0, sival_ptr = 0x0}}, msgwaitq = 0x0, pthread_data = {0x0, 0x0, 0x0, 0x0}, pterrno = 0, xcp = {sigdeliver = 0x0, saved_pc = 0, saved_basepri = 0, saved_xpsr = 0, regs = {218265720, 128, 218272020, 218262960, 224, 25, 1, 0, 1, 0, 4294967273, 0 <repeats 16 times>, 2, 218263092, 218271748, 0, 110, 218118601, 218131740, 1090519040, 0 <repeats 17 times>, 218118601}}, name = "init", '\000' <repeats 27 times>}
(gdb)
(gdb) where
#0 up_testset () at armv7-m/gnu/up_testset.S:101
#1 0x0d003a60 in spin_lock (lock=lock@entry=0xd022707 <g_cpu_wait+1> "\001") at semaphore/spinlock.c:89
#2 0x0d001368 in up_cpu_paused (cpu=1) at chip/cxd56_cpupause.c:232
#3 0x0d002e2e in irq_dispatch (irq=irq@entry=112, context=context@entry=0xd026cbc) at irq/irq_dispatch.c:176
#4 0x0d001570 in up_doirq (irq=112, regs=0xd026cbc) at armv7-m/up_doirq.c:86
#5 0x0d0002f8 in exception_common () at armv7-m/gnu/up_exception.S:213
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
857: 0b 02 64 00 03 00 50 05 00 00 06 CPU0 PID 3: SUSPEND
868: 0a 03 00 00 00 00 50 05 00 00 CPU0 PID 0: RESUME
878: 0b 02 00 00 00 00 6b 05 00 00 03 CPU0 PID 0: SUSPEND
889: 0a 03 64 00 03 00 6b 05 00 00 CPU0 PID 3: RESUME
899: 0b 02 64 00 03 00 6b 05 00 00 06 CPU0 PID 3: SUSPEND
910: 0a 03 00 00 00 00 6b 05 00 00 CPU0 PID 0: RESUME
920: 0b 02 00 00 00 00 9d 05 00 00 03 CPU0 PID 0: SUSPEND
931: 0a 03 64 00 03 00 9d 05 00 00 CPU0 PID 3: RESUME
941: 10 00 64 00 05 00 9d 05 00 00 68 65 6c 6c 6f 00 CPU0 PID 5: START
957: 0b 02 64 00 03 00 9d 05 00 00 06 CPU0 PID 3: SUSPEND
968: 0a 03 64 00 05 00 9d 05 00 00 CPU0 PID 5: RESUME
978: 0b 06 64 00 05 00 9d 05 00 00 01 CPU0 PID 5: CPU_PAUSE
989: 0b 02 00 01 01 00 9d 05 00 00 04 CPU1 PID 1: SUSPEND
1000: 0a 07 00 01 01 00 9d 05 00 00 CPU1 PID 1: CPU_PAUSED
I am implementing the fix on a fork. This will probably take one, maybe two days. I will try to setup my old i.MX6 to test. I would appreciate if you could help with the testing when the change is ready.@masayuki You seem to believe that these problems would go away if the support for the ICCMPR register were implemented. Is that true? We should talk more before I go implementing additional OS task state logic.

I remember that this is (perhaps) an intended logic and g_cpu_irqset on this cpu is released in sched_resumesucheduler(). So I think this is a correct behavior.
I think a simple way to duplicate the problem on a processor with 2 CPUs would be like:
1. Start high priority Task A on CPU 0 which does:
g_count = 0;
nxsem_init(&sem, 0, 0);
flags = enter_critical_section();
nxsem_wait(&sem);
g_count++;
leave_critical_section(flags);
Task A will then be sleeping and expects to hold the critical
section.
2. Start another slightly lower pririty Task B on CPU 1 which does:
flags = enter_critical_section();
nxsem_post(&sem);
g_count++;
leave_critical_section(flags);
Task B will run on CPU 1 and for a brief moment, both should hold
the critical section. I would expect:
In this case, Task A and Task B may interfere with each other. g_count++ is not atomic. It consists of at least 3 steps: fetch, add, store. In this case, they should be happening at about the same time on each CPU and I could not predict the value of g_count at the end. It could be 1 or 2, depending of the relative timing of the fetch and store instructions.
Timing of the context switch might be such that they never
interfere with each other (and the result would always be 2), but
if a longer interfering sequence were there then they should
interfere (maybe just incrementing a volatile g_count 100 times),.
It would be nice if we could confirm that this problem exists.
No one, other the the author of this post, has ever seen it
(although I have suspect issues here for a long time). So it is
possible that there may be some protections, but I just don't know
where they are. SMP is kind of complex.
Greg
I am implementing the fix on a fork. This will probably take one, maybe two days. I will try to setup my old i.MX6 to test. I would appreciate if you could help with the testing when the change is ready.
--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/de1d242c-bd0b-4de9-b005-294d597e54a0%40googlegroups.com.
yes, i would like to help do the test, so how the progress now? thank you!
I have run into some design obstacles and I am not working on it now.
The basic problem with my origin idea is that it is necessary to have more than one task hold the critical section at least momentarily:
In this case, it is perfectly normal for a both tasks on different CPUs to hold the lock. But it makes it impossible to distinguish between the case where both CPUs will hold the lock for a long period of time and interfere with each other.
I think that a correct solution would need to separate that single big lock into to locks. One that behaves as the normal critical section and one that just protects the tasks lists with a spinlock. Masayuki has already implemented this second lock for task lists only. But I don't think it is being used strictly enough at present.
If the two lock uses were separated cleanly, then it would be easy to detect if starting a new task would interfere because it holds the critical section. Right not that is not possible.
Changing all of the locking in the sched/ would be a big job and not something that I want to do right now.
Greg