This happened when:1. Timer callback timer_timeout() called not in an IRQ2. Thread call timer_timeout() --> timer_signotify() --> switch to high priority thread --> timer_delete()
static void timer_timeout(int argc, wdparm_t itimer)
{
FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;
if (!up_interrupt_context())
{
syslog(LOG_INFO, "WARNING, %s, NOT IRQ, line %d\n", __func__, __LINE__);
}
The pink line is mine added log info, and it print out. That says we are not in IRQ.
This does not make sense. The timer uses a wdog for timeout and, as the comments for the function say, that function should ONLY be called from the timer interrupt handler. It must always be in the interrupt handler context or else something is very wrong.
I think you first need to understand the root cause of the problem before proposing a solution. timer_timeout() must only run from the timer interrupt handler and sched_lock() makes no sense.
Another warning: sched_lock() would not be the correct interface to use if you are in an interrupt handler OR even if you are not in an interrupt handler but you want to establish a critical section. The function of sched_lock() is to prevent your thread from being suspended so that it will keep running until sched_unlock().
If there is only a single CPU, then sched_lock() does behave kind of like a critical section. But in SMP mode with multiple CPUs it does not work that way: The thread will still keep running while the scheduler is locked, however, it will not effect tasks running on the other CPUs.
Greg
Hi, Guiding,
...This happened when:1. Timer callback timer_timeout() called not in an IRQ2. Thread call timer_timeout() --> timer_signotify() --> switch to high priority thread --> timer_delete()
static void timer_timeout(int argc, wdparm_t itimer)
{
FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;
if (!up_interrupt_context())
{
syslog(LOG_INFO, "WARNING, %s, NOT IRQ, line %d\n", __func__, __LINE__);
}
The pink line is mine added log info, and it print out. That says we are not in IRQ.
This does not make since. The timer uses a wdog for timeout and, as the comments for the function say, that function should ONLY be called from the timer interrupt handler. It must always be in the interrupt handler context or else something is very wrong.
I think you first need to understand the root cause of the problem before proposing a solution. timer_timeout() must only run from the timer interrupt handler and sched_lock() makes no sense.
Another warning: sched_lock() would not be the correct interface to use if you are in an interrupt handler OR even if you are not in an interrupt handler but you want to establish a critical section. The function of sched_lock() is to prevent your thread from being suspended so that it will keep running until sched_unlock().
If there is only a single CPU, then sched_lock() does behave kind of like a critical section. But in SMP mode with multiple CPUs it does not work that way: The thread will still keep running while the scheduler is locked, however, it will not effect tasks running on the other CPUs.
Greg
To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/25a33590-77ab-460d-8895-620ead5a26d8%40googlegroups.com.
3. the root cause I think is:Task A: wd_start() (enter_critical_section) -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout() -> timer_signotify() -> up_unblock_task(Task B)
Task B: timer_delete() return success, but actually NOT delete timer, Task B pending
Task A: up_unblock_task() return -> timer_timeout() continue ringing
I will look into this more when I have a chance. Today looks like it is going to be a busy day.
Task A: wd_start() (enter_critical_section) -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout() -> timer_signotify() -> up_unblock_task(Task B)Task B: timer_delete() return success, but actually NOT delete timer, Task B pendingTask A: up_unblock_task() return -> timer_timeout() continue ringing
142 static void timer_timeout(int argc, wdparm_t itimer)
143 {
158 /* Send the specified signal to the specified task. Increment the
159 * reference count on the timer first so that will not be deleted until
160 * after the signal handler returns.
161 */
162
163 u.timer->pt_crefs++;
164 timer_signotify(u.timer);
Task B wakes up here and calls timer_delete():
sched/timer/timer_delete.c:
78 int timer_delete(timer_t timerid)
79 {
80 int ret = timer_release((FAR struct posix_timer_s *)timerid);
sched/wdog/wd_start.c:
166 /* Release the reference. timer_release will return nonzero if the timer
167 * was not deleted.
168 */
169
170 if (timer_release(u.timer))
171 {
172 /* If this is a repetitive timer, then restart the watchdog */
173
174 timer_restart(u.timer, itimer);
175 }
So something in that sequence is not working. Perhaps the reference count is not be respected in timer_release() or timer_delete(). That seems to be the only way that the scenario you describe could occur.
Perhaps you could add instrumentation that tracks the wdog reference count from allocation until when it is deleted. That should tell you exactly what is happening.
Greg
No, that is not the problem. Another way to handle the stale pointer problem is via reference counts and there is already a reference count on the timer structure.sched/wdog/wd_start.c:142 static void timer_timeout(int argc, wdparm_t itimer)
143 {
158 /* Send the specified signal to the specified task. Increment the
159 * reference count on the timer first so that will not be deleted until
160 * after the signal handler returns.
161 */
162
163 u.timer->pt_crefs++;
164 timer_signotify(u.timer);The timer reference count was probably one on entry into timer_timeout, but was incremented to two before signaling task B.Task B wakes up here and calls timer_delete():
sched/timer/timer_delete.c:
78 int timer_delete(timer_t timerid)
79 {
80 int ret = timer_release((FAR struct posix_timer_s *)timerid);timer_delete() must respect the reference count. It must not actually delete the timer here. If everything is working correctly, the reference count should be one when Task A enters timer_timeout(). It should be two when timer_signotify() is called and when Task B runs. The call to timer_release() in Task B should do nothing but decrement the reference back to one.
79 int timer_delete(timer_t timerid)
80 {
81 int ret = timer_release((FAR struct posix_timer_s *)timerid);
82 if (ret < 0)
83 {
84 set_errno(-ret);
85 return ERROR;
86 }
87
88 if (ret == 1)
89 {
90 syslog(LOG_INFO, "WARNING, %s, line %d\n", __func__, __LINE__);
91 }
92
93 return OK;
94 }
>
> The pink line, print log when meet error.
Task B has crefs 2, so timer_release return value 1 and set crefs down to 1.
The point it timer_delete() ignore the timer_release() return 1, then just return OK. So user think timer_delete() has already return OK, then the timer must be deleted.
But later the "user-think-deleted-timer" ringing, receive the timer callback.
The simplest resolution is:Let timer_delete() return -EAGAIN when timer_release() return 1. Let user know timer_delete() failed.
But this resolution don't respect your original design: timer_settime() must be called in IRQ handler.If we can obey the rules: "timer_settime() must be called in IRQ handler.", then this problem will not happen.
Step 1: Priority 100 Task A: wd_start() -> ... -> timer_timeout() -> crefs++(2) -> timer_signotify() -> signal NOTIFY-work to TASK B -> up_unblock_task(TASK B) pink line timer_timeout printf out
Step 2: Priority 120 Task B: prepare to do NOTIFY work, but interrupt by Task C.
Step 3: Priority 200 Task C: timer_delete() return OK, crefs-- (1), but actually NOT delete timer, haven't free the real timer pink line timer_delete printf out
That should not be a problem. The caller of timer_delete() does not care if the memory is deleted or not, only that the timer is fully disabled. Note that timer_delete() calls timer_event() which disables the wdog timer:
137 /* Free the underlying watchdog instance (the timer will be
canceled by the
138 * watchdog logic before it is actually deleted)
139 */
140
141 (void)wd_delete(timer->pt_wdog);
So if the timer is deleted BEFORE the notification, then timer_timeout() will not run because the timeout has been canceled. If timer_timeout () is blocked in the notification when the timer is deleted, it will catch this when it calls timer_release() after resuming.
Step 4: Priority 120 Task B: do NOTIFY work, user callback called, user see timer ringring.
Step 5: Priority 100 Task A: up_unblock_task() return -> timer_signotify() return -> timer_release() called crefs-- (0).
And this should prevent timer_timeout() from restarting the
timer. Everything looks good to me.
I don't think that that the above scenario could ever happen. We would need more data to understand the root cause of the problem. We should not discuss solutions until the root cause of the problem is understood. Just changing code without a full understanding would just be putting a bandage on the software.
Greg
To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.
All my confuse about timer is just: additional notifications after calling the timer_delete().
If you think:A few additional notifications after disableing the timer is a possibility in normal operation.Then that's OK. ^_^
But however, there is a risk on "timer_timeout() does run in a Task contexts".In one big & complex system, this maybe caused some very bad behavior.So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".
But however, there is a risk on "timer_timeout() does run in a Task contexts".In one big & complex system, this maybe caused some very bad behavior.So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".
Sent from Samsung tablet.--But however, there is a risk on "timer_timeout() does run in a Task contexts".In one big & complex system, this maybe caused some very bad behavior.So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".I don't think that is necessarily true. I am not aware of any problems in the design. There is no requirement that timer timeout run in a task context, provided that it runs within a critical section so that interrupts are disabled just like in an interrupt handler. Why do you think that must happen?Is your concern that timer_timeout() could be suspended, then re-entered with a real timer interrupt? I don't think that could happen in tickless mode. Has the interval timer been restarted? I would have check but I don't think so.It certainly could not be re-entered for this same timer because it has not yet been restarted at the time of the possible suspension.Greg
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.
No, have no strong enough proof to support this risk. Currently.I will report it you if I meet likely problem.Just keep in mind now.
Another warning: sched_lock() would not be the correct interface to use if you are in an interrupt handler OR even if you are not in an interrupt handler but you want to establish a critical section. The function of sched_lock() is to prevent your thread from being suspended so that it will keep running until sched_unlock().
If there is only a single CPU, then sched_lock() does behave kind of like a critical section. But in SMP mode with multiple CPUs it does not work that way: The thread will still keep running while the scheduler is locked, however, it will not effect tasks running on the other CPUs.
I believe that that is incorrect. When you start a new timer, there are two possibilities:
1. The current interval delay is shorter than the new timer delay. In this case, nothing really has to be done. The next timer expiration does not change. What you suggest above works in that case.
2. The current interval delay is longer than the new timer delay. In this case, we must a) cancel the current interval timer, b) update the time for the elapsed time on the canceled timer, then c) restart the interval timer with shorter delay. That is what wd_expiration() does that and that is why it must be called.
Normally, calling wd_expiration() has not unusual side effects. There is only one case: If the interval timer has already expired. Since interrupts are disabled in the timer code, that interrupt will pend. In this case, calling wd_expiration() can cause a timer to expire in 2b. But this is all good and normal. I would not change it.
Greg