sched/timer: wdog timer already deleted by timer_delete() successful but still ringing

95 views
Skip to first unread message

Guiding Li

unread,
Aug 27, 2019, 8:38:14 AM8/27/19
to NuttX

Hi Greg,

1. Create timer by timer_create() and call timer_settime() to setup timer
2. Call timer_delete() and return success
3. Still meet timer_callbak ringing

This happened when:
1. Timer callback timer_timeout() called not in an IRQ
2. Thread call timer_timeout() --> timer_signotify() --> switch to high priority thread --> timer_delete()

static void timer_timeout(int argc, wdparm_t itimer)
{
  FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;

  if (!up_interrupt_context())
    {  
      syslog(LOG_INFO, "WARNING, %s, NOT IRQ, line %d\n",  __func__, __LINE__);
    }  


  /* Send the specified signal to the specified task.   Increment the
   * reference count on the timer first so that will not be deleted until
   * after the signal handler returns.
   */

  timer->pt_crefs++;
  timer_signotify(timer);    // switch to another thread and do timer_delte()

  /* Release the reference.  timer_release will return nonzero if the timer
   * was not deleted.
   */

  if (timer_release(timer))
    {  
      /* If this is a repetitive timer, the restart the watchdog */

      timer_restart(timer, itimer);
    } 
}

The pink line is mine added log info, and it print out. That says we are not in IRQ.


Resolve:
Should we fix this question by following, or you have more advise ?

static void timer_timeout(int argc, wdparm_t itimer)
{
  FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;

  sched_lock();

  /* Send the specified signal to the specified task.   Increment the
   * reference count on the timer first so that will not be deleted until
   * after the signal handler returns.
   */

  timer->pt_crefs++;
  timer_signotify(timer);

  /* Release the reference.  timer_release will return nonzero if the timer
   * was not deleted.
   */

  if (timer_release(timer))
    {  
      /* If this is a repetitive timer, the restart the watchdog */

      timer_restart(timer, itimer);
    } 

  sched_unlock();
}

Guiding

Gregory Nutt

unread,
Aug 27, 2019, 10:09:46 AM8/27/19
to nu...@googlegroups.com
Hi, Guiding,
This happened when:
1. Timer callback timer_timeout() called not in an IRQ
2. Thread call timer_timeout() --> timer_signotify() --> switch to high priority thread --> timer_delete()

static void timer_timeout(int argc, wdparm_t itimer)
{
  FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;

  if (!up_interrupt_context())
    {  
      syslog(LOG_INFO, "WARNING, %s, NOT IRQ, line %d\n",  __func__, __LINE__);
    }  

...

The pink line is mine added log info, and it print out. That says we are not in IRQ.

This does not make sense.  The timer uses a wdog for timeout and, as the comments for the function say, that function should ONLY be called from the timer interrupt handler.  It must always be in the interrupt handler context or else something is very wrong.

I think you first need to understand the root cause of the problem before proposing a solution.  timer_timeout() must only run from the timer interrupt handler and sched_lock() makes no sense.


Another warning:  sched_lock() would not be the correct interface to use if you are in an interrupt handler OR even if you are not in an interrupt handler but you want to establish a critical section.  The function of sched_lock() is to prevent your thread from being suspended so that it will keep running until sched_unlock().

If there is only a single CPU, then sched_lock() does behave kind of like a critical section.  But in SMP mode with multiple CPUs it does not work that way:  The thread will still keep running while the scheduler is locked, however, it will not effect tasks running on the other CPUs.

Greg


Guiding Li

unread,
Aug 27, 2019, 11:56:45 PM8/27/19
to NuttX


在 2019年8月27日星期二 UTC+8下午10:09:46,patacongo写道:
Hi, Guiding,
This happened when:
1. Timer callback timer_timeout() called not in an IRQ
2. Thread call timer_timeout() --> timer_signotify() --> switch to high priority thread --> timer_delete()

static void timer_timeout(int argc, wdparm_t itimer)
{
  FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;

  if (!up_interrupt_context())
    {  
      syslog(LOG_INFO, "WARNING, %s, NOT IRQ, line %d\n",  __func__, __LINE__);
    }  

...
The pink line is mine added log info, and it print out. That says we are not in IRQ.

This does not make since.  The timer uses a wdog for timeout and, as the comments for the function say, that function should ONLY be called from the timer interrupt handler.  It must always be in the interrupt handler context or else something is very wrong.

I think you first need to understand the root cause of the problem before proposing a solution.  timer_timeout() must only run from the timer interrupt handler and sched_lock() makes no sense.


Here is calling stack:
wd_start() -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout()
So, there was one opportunity to execute timer_timeout() in THREAD.

I have reported this situation before:


Another warning:  sched_lock() would not be the correct interface to use if you are in an interrupt handler OR even if you are not in an interrupt handler but you want to establish a critical section.  The function of sched_lock() is to prevent your thread from being suspended so that it will keep running until sched_unlock().

If there is only a single CPU, then sched_lock() does behave kind of like a critical section.  But in SMP mode with multiple CPUs it does not work that way:  The thread will still keep running while the scheduler is locked, however, it will not effect tasks running on the other CPUs.

Greg



Ok, got your point, use enter_critical_section NOT sched_lock.
But if we can cut off the route which timer_timeout called by THREAD, then we will do nothing to function timer_timeout().

Guiding

spudaneco

unread,
Aug 28, 2019, 12:12:03 AM8/28/19
to nu...@googlegroups.com
I will need to study this in the morning.  In the calling sequence you show, it seems like the critical section should go higher up.  All logic in nxsched_timer_process() shoul run in a critical section.

Sent from Samsung tablet.
--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/89d4f538-5924-4aa9-a640-bb8fec06cc46%40googlegroups.com.

Guiding Li

unread,
Aug 28, 2019, 12:30:52 AM8/28/19
to NuttX
I suppose we should better cut off the rote:
sched_timer_cancel() -> nxsched_timer_process()
sched_timer_resume() -> nxsched_timer_process()

and make sure nxsched_timer_process() ONLY called by IRQ handler.

We will discuss this later. Good night.

在 2019年8月28日星期三 UTC+8下午12:12:03,patacongo写道:
To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

spudaneco

unread,
Aug 28, 2019, 12:51:04 AM8/28/19
to nu...@googlegroups.com
I think it just needs a critical section around the call to nxsched_tmer_process in both cases to prevent the timer interrupt from interfering.

Greg
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/25a33590-77ab-460d-8895-620ead5a26d8%40googlegroups.com.

Guiding Li

unread,
Aug 28, 2019, 3:10:45 AM8/28/19
to NuttX

You mean modify like this ?
sched_timer_cancel()
{
  xxx...
  enter_critical_section();
  nxsched_tmer_process();
  leave_critical_section();
  xxx...
}

This can NOT work for our situation.
wd_start() -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout()

1. wd_timer() still called from NON-interrupt handler
2. wd_start() already in critical section, but this problem still happens
3. the root cause I think is:
    Task A: wd_start() (enter_critical_section) -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout() -> timer_signotify() -> up_unblock_task(Task B)
    Task B: timer_delete() return success, but actually NOT delete timer, Task B pending
    Task A: up_unblock_task() return -> timer_timeout() continue ringing

在 2019年8月28日星期三 UTC+8下午12:51:04,patacongo写道:

Gregory Nutt

unread,
Aug 28, 2019, 10:15:20 AM8/28/19
to nu...@googlegroups.com

3. the root cause I think is:
    Task A: wd_start() (enter_critical_section) -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout() -> timer_signotify() -> up_unblock_task(Task B)
    Task B: timer_delete() return success, but actually NOT delete timer, Task B pending
    Task A: up_unblock_task() return -> timer_timeout() continue ringing

I will look into this more when I have a chance.  Today looks like it is going to be a busy day.


patacongo

unread,
Aug 28, 2019, 4:39:22 PM8/28/19
to NuttX

    Task A: wd_start() (enter_critical_section) -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout() -> timer_signotify() -> up_unblock_task(Task B)
    Task B: timer_delete() return success, but actually NOT delete timer, Task B pending
    Task A: up_unblock_task() return -> timer_timeout() continue ringing

Okay, I think I understand the problem.  The logic running on Task A will signal Task B:

Task A: wd_start() (enter_critical_section) -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout() -> timer_signotify() -> up_unblock_task(Task B)

sched/wdog/wd_start.c:
142 static void timer_timeout(int argc, wdparm_t itimer)
143 {
158   /* Send the specified signal to the specified task.   Increment the
159    * reference count on the timer first so that will not be deleted until
160    * after the signal handler returns.
161    */
162
163   u.timer->pt_crefs++;
164   timer_signotify(u.timer);

Task B wakes up here and calls timer_delete():

sched/timer/timer_delete.c:
78 int timer_delete(timer_t timerid)
 79 {
 80   int ret = timer_release((FAR struct posix_timer_s *)timerid);
 
The timer is now deleted.  Any existing reference to the timer is now a stale pointer, that is, it points to memory that has be de-allocated.  When Task A resumes in timer_timout(), it will use a stale pointer to restart the timer if it is periodic:

 sched/wdog/wd_start.c:
166   /* Release the reference.  timer_release will return nonzero if the timer
167    * was not deleted.
168    */
169
170   if (timer_release(u.timer))
171     {
172       /* If this is a repetitive timer, then restart the watchdog */
173
174       timer_restart(u.timer, itimer);
175     }

So the most direct and probably the best solution would be to detect the stale handle in timer_timeout().  I think that would be the better solution because it is at the point of the failure and cannot cause any other changes in system behavior.

The usual way of detecting a stale pointer would be to put a unique ID value in the wdog structure.  This unique ID is generated from a counter when the wdog is allocated and set to zero when the wdog is freed.  Obviously, the ID should never be set to zero.

Before calling timer_signotify() , timer_timeout(() should save the unique value of the wdog.  On return, before continuing operation, timer_timeout() should verify that the ID has not changed.  Any different value means that the pointer is not valid.

Let me think if there is a better way to detect a stale wdog structure.  But that is the way I have done this often in the past.

patacongo

unread,
Aug 28, 2019, 5:24:15 PM8/28/19
to NuttX
No, that is not the problem.  Another way to handle the stale pointer problem is via reference counts and there is already a reference count on the timer structure.

sched/wdog/wd_start.c:

142 static void timer_timeout(int argc, wdparm_t itimer)
143 {
158   /* Send the specified signal to the specified task.   Increment the
159    * reference count on the timer first so that will not be deleted until
160    * after the signal handler returns.

161    */
162
163   u.timer->pt_crefs++;
164   timer_signotify(u.timer);

The timer reference count was probably one on entry into timer_timeout, but was incremented to two before signaling task B.

Task B wakes up here and calls timer_delete():

sched/timer/timer_delete.c:
78 int timer_delete(timer_t timerid)
 79 {
 80   int ret = timer_release((FAR struct posix_timer_s *)timerid);

timer_delete() must respect the reference count.  It must not actually delete the timer here.  If everything is working correctly, the reference count should be one when Task A enters timer_timeout().  It should be two when timer_signotify() is called and when Task B runs.  The call to timer_release() in Task B should do nothing but decrement the reference back to one.

So when control comes back to timer_timeout(), the reference count should again be one.
 
 sched/wdog/wd_start.c:
166   /* Release the reference.  timer_release will return nonzero if the timer
167    * was not deleted.

168    */
169
170   if (timer_release(u.timer))
171     {
172       /* If this is a repetitive timer, then restart the watchdog */
173
174       timer_restart(u.timer, itimer);
175     }

Since the reference count is one this call to timer_release WILL delete the timer.  timer_release() will return zero and that should be the end of the time.

So something in that sequence is not working.  Perhaps the reference count is not be respected in timer_release() or timer_delete().  That seems to be the only way that the scenario you describe could occur.

Greg

Gregory Nutt

unread,
Aug 28, 2019, 6:30:58 PM8/28/19
to nu...@googlegroups.com

 

So something in that sequence is not working.  Perhaps the reference count is not be respected in timer_release() or timer_delete().  That seems to be the only way that the scenario you describe could occur.

 

Perhaps you could add instrumentation that tracks the wdog reference count from allocation until when it is deleted.  That should tell you exactly what is happening.

 

Greg

Guiding Li

unread,
Aug 28, 2019, 11:10:22 PM8/28/19
to NuttX


在 2019年8月29日星期四 UTC+8上午5:24:15,patacongo写道:
No, that is not the problem.  Another way to handle the stale pointer problem is via reference counts and there is already a reference count on the timer structure.

sched/wdog/wd_start.c:

142 static void timer_timeout(int argc, wdparm_t itimer)
143 {
158   /* Send the specified signal to the specified task.   Increment the
159    * reference count on the timer first so that will not be deleted until
160    * after the signal handler returns.

161    */
162
163   u.timer->pt_crefs++;
164   timer_signotify(u.timer);

The timer reference count was probably one on entry into timer_timeout, but was incremented to two before signaling task B.

Task B wakes up here and calls timer_delete():

sched/timer/timer_delete.c:
78 int timer_delete(timer_t timerid)
 79 {
 80   int ret = timer_release((FAR struct posix_timer_s *)timerid);

timer_delete() must respect the reference count.  It must not actually delete the timer here.  If everything is working correctly, the reference count should be one when Task A enters timer_timeout().  It should be two when timer_signotify() is called and when Task B runs.  The call to timer_release() in Task B should do nothing but decrement the reference back to one.

 79 int timer_delete(timer_t timerid)
 80 {
 81   int ret = timer_release((FAR struct posix_timer_s *)timerid);
 82   if (ret < 0)
 83     {
 84       set_errno(-ret);
 85       return ERROR;
 86     }
 87
 88   if (ret == 1)
 89     {
 90       syslog(LOG_INFO, "WARNING, %s, line %d\n",  __func__, __LINE__);
 91     }

 92
 93   return OK;
 94 }

The pink line, print log when meet error.
Task B has crefs 2, so timer_release return value 1 and set crefs down to 1.
The point it timer_delete() ignore the timer_release() return 1, then just return OK. So user think timer_delete() has already return OK, then the timer must be deleted.
But later the "user-think-deleted-timer" ringing, receive the timer callback.

The simplest resolution is:
Let timer_delete() return -EAGAIN when timer_release() return 1. Let user know timer_delete() failed.

But this resolution don't respect your original design: timer_settime() must be called in IRQ handler.
If we can obey the rules: "timer_settime() must be called in IRQ handler.", then this problem will not happen.

patacongo

unread,
Aug 29, 2019, 12:09:44 AM8/29/19
to nu...@googlegroups.com

79 int timer_delete(timer_t timerid)
80 {
81   int ret = timer_release((FAR struct posix_timer_s *)timerid);
82   if (ret < 0)
83     {
84       set_errno(-ret);
85       return ERROR;
86     }
87
88   if (ret == 1)
89     {
90       syslog(LOG_INFO, "WARNING, %s, line %d\n",  __func__, __LINE__);
91     }

92
93   return OK;
94 }
>
> The pink line, print log when meet error.

The return value of 1 is not an error.  It is success.  All errors are represented by negative numbers, all successes are represented by zero or positive numbers.

Returning zero (OK) is the correct thing to.  From the stand point of the caller, the timer is deleted -- Deleted meaning that the reference held by the caller has been subtracted.

The reference count allows the data allocation to persist until it is no longer needed by the OS.  This is all correct.

Task B has crefs 2, so timer_release return value 1 and set crefs down to 1.

Correct behavior
 
The point it timer_delete() ignore the timer_release() return 1, then just return OK. So user think timer_delete() has already return OK, then the timer must be deleted.

 OK is the correct return value.  It means that the caller no longer has a reference to the timer.

But later the "user-think-deleted-timer" ringing, receive the timer callback.

 I don't see how that can happen.  When it returns to timer_timeout(), the reference count will be decremeted to zero and the timer memory will try be freed.  That is how it is supposed to work.  The reference count controls the life of the memory allocation.  The memory will persist until the reference count decrements to zero.
sched/wdog/wd_start.c:

142 static void timer_timeout(int argc, wdparm_t itimer)
143 {
158   /* Send the specified signal to the specified task.   Increment the
159    * reference count on the timer first so that will not be deleted until
160    * after the signal handler returns.
161    */
162
163   u.timer->pt_crefs++;      <<< Reference count becomes 2
164   timer_signotify(u.timer);
265                             <<< On return reference count goes to one.

166   /* Release the reference.  timer_release will return nonzero if the timer
167    * was not deleted.
168    */
169
170   if (timer_release(u.timer)) <<< Reference count goes to zero, timer memory deleted
171     {  <<< timer_release() returns 0 so this logic does not execute

172       /* If this is a repetitive timer, then restart the watchdog */
173
174       timer_restart(u.timer, itimer);
175     }
The simplest resolution is:
Let timer_delete() return -EAGAIN when timer_release() return 1. Let user know timer_delete() failed.

No, that would be incorrect.  From the standpoint of the caller, the timer is deleted (but the memory will persist for a little while).  That return value is also not permitted under POSIX:  http://pubs.opengroup.org/onlinepubs/009695399/functions/timer_delete.html
 
But this resolution don't respect your original design: timer_settime() must be called in IRQ handler.
If we can obey the rules: "timer_settime() must be called in IRQ handler.", then this problem will not happen.

That is not correct. timer_settime() must NEVER be called from an interrupt handler.  It a POSIX user function not permitted in interrupt handers:  http://pubs.opengroup.org/onlinepubs/009695399/functions/timer_settime.html

Perhaps you meant some other function?

I see nothing wrong this logic.  Nor has this problem ever been reported by anyone in the last dozen years.  I believe that you have some else wrong in you implementation.  This is not the cause of your problem.

If there were an error in the reference counting, or if the wdog structure were corrupted somehow by other logic, then this symptom could occur.  But it is not caused by the logic that we are talking about.  The reference counting should prevent the scenario you talk about from ever happening.

Greg

Guiding Li

unread,
Aug 29, 2019, 6:42:57 AM8/29/19
to NuttX

More details:

179   FAR struct posix_timer_s *timer = (FAR struct posix_timer_s *)itimer;
180
181   if (!up_interrupt_context())
182     {
183       printf("WARNING, %s, NOT IRQ, pid %d, line %d\n",  __func__, getpid(), __LINE__);
184     }

185
186   /* Send the specified signal to the specified task.   Increment the
187    * reference count on the timer first so that will not be deleted until
188    * after the signal handler returns.
189    */
190
191   timer->pt_crefs++;
192   timer_signotify(timer);
193      
194   if (!up_interrupt_context())      //add this for easily reproduce
195     {
196       usleep(10);
197     }

198    
199   /* Release the reference.  timer_release will return nonzero if the timer
200    * was not deleted.
201    */
202
203   if (timer_release(timer))
204     {
205       /* If this is a repetitive timer, the restart the watchdog */
206      
207       timer_restart(timer, itimer);
208     }

 79 int timer_delete(timer_t timerid)
 80 {
 81   int ret = timer_release((FAR struct posix_timer_s *)timerid);
 82   if (ret < 0)
 83     {
 84       set_errno(-ret);
 85       return ERROR;
 86     }
 87
 88   if (ret == 1)
 89     {
 90       printf("WARNING, %s, line %d\n",  __func__, __LINE__);
 91     }

 92    
 93   return OK;
 94 }


  Timer's sigev_notify = SIGEV_SIGNAL.

 Step 1:  Priority 100 Task A:     wd_start() -> ... -> timer_timeout() -> crefs++(2) -> timer_signotify() -> signal NOTIFY-work to TASK B -> up_unblock_task(TASK B)       pink line timer_timeout printf out
 Step 2:  Priority 120 Task B:     prepare to do NOTIFY work, but interrupt by Task C.
 Step 3:  Priority 200 Task C:     timer_delete() return OK, crefs-- (1), but actually NOT delete timer, haven't free the real timer              pink line timer_delete printf out
 Step 4:  Priority 120 Task B:     do NOTIFY work, user callback called, user see timer ringring.
 Step 5:  Priority 100 Task A:     up_unblock_task() return -> timer_signotify() return -> timer_release() called crefs-- (0).

Step 3 is my guessing step according to  pink line timer_delete printf out.
Cause this problem is different to reproduce. So I need to add more log and wait for reproduce.


But this resolution don't respect your original design: timer_settime() must be called in IRQ handler.
If we can obey the rules: "timer_settime() must be called in IRQ handler.", then this problem will not happen
Sorry, my writing fault. should be:  timer_timeout()

Gregory Nutt

unread,
Aug 29, 2019, 9:21:15 AM8/29/19
to nu...@googlegroups.com
Hi, Guiding,


 Step 1:  Priority 100 Task A:     wd_start() -> ... -> timer_timeout() -> crefs++(2) -> timer_signotify() -> signal NOTIFY-work to TASK B -> up_unblock_task(TASK B)       pink line timer_timeout printf out
 Step 2:  Priority 120 Task B:     prepare to do NOTIFY work, but interrupt by Task C.
 Step 3:  Priority 200 Task C:     timer_delete() return OK, crefs-- (1), but actually NOT delete timer, haven't free the real timer              pink line timer_delete printf out

That should not be a problem.  The caller of timer_delete() does not care if the memory is deleted or not, only that the timer is fully disabled.  Note that timer_delete() calls timer_event() which disables the wdog timer:

137   /* Free the underlying watchdog instance (the timer will be canceled by the
138    * watchdog logic before it is actually deleted)
139    */
140
141   (void)wd_delete(timer->pt_wdog);

So if the timer is deleted BEFORE the notification, then timer_timeout() will not run because the timeout has been canceled.  If timer_timeout () is blocked in the notification when the timer is deleted, it will catch this when it calls timer_release() after resuming.

 Step 4:  Priority 120 Task B:     do NOTIFY work, user callback called, user see timer ringring.
By "ringing", you mean that another event signal is pending?  How could that happen?  That could not happen in this scenario.  So this part does not make any sense to me.   I think you still do not have the root cause of the problem.

 Step 5:  Priority 100 Task A:     up_unblock_task() return -> timer_signotify() return -> timer_release() called crefs-- (0).

And this should prevent timer_timeout() from restarting the timer.  Everything looks good to me.

I don't think that that the above scenario could ever happen.  We would need more data to understand the root cause of the problem.  We should not discuss solutions until the root cause of the problem is understood.  Just changing code without a full understanding would just be putting a bandage on the software.

Greg


Gregory Nutt

unread,
Aug 29, 2019, 9:59:52 AM8/29/19
to nu...@googlegroups.com
Wait a minute.. we have both been showing the wrong sequence!!! 
timer_timeout() is a wdog callback so it runs on the context of timer
interrupt handler.  It does not run in a Task contexts.  There can be no
context switches in timer_timeout() until we return from the timer
interrupt handler!

The corrected sequence would be as follows:

Step 1:  Task A calls timer_settime() which startd timer->pt_wdog.

Step 2:  A timer interrupt occurs. timer->pt_wdog expires and
timer_timeout() is called

Step 3: timer_timeout() increments the reference count to 2

Step 4: timer_timeout() calls timer_signotify() which unblocks task B. 
Task B cannot run, however, until we return from the timer interrupt

Step 5: timer_timeout() calls timer_release().  Reference count goes
back to 1.  Timer is not deleted.  The next periodic timer is setup.

Step 6: Timer interrupt returns.  Then Task B runs.

Step 7: Task B deletes the timer. Reference count goes to zero. 
timer_release() cancels the pending timer,  de-allocates all timer memory

That is all good.  But now that we have the corrected sequence, I can
see is situation where  you could get extra timer expiration
notifications.  Replace steps 6-7 with:

Step 6:  Timer interrupt returns.  Task B is ready to run, but cannot
run because Task C is running and has higher prirotity

Step 7: Task C runs for a long time.  The timer expires and
timer_timeout() runs again before Task C blocks.  An additional event
notification would be queued.

Step 8: Task C blocks and Task B runs.  It deletes the timer.
timer_release() cancels the pending timer,  de-allocates all timer memory.

But there is already a pending event from Step 7.  In that case, Task B
will receive one or more additional notifications AFTER it deletes the
timer.

But this is not an OS problem.  This problem occurs because the system
is too busy, the periodic timer rate is to high and the code cannot
delete the timer quickly enough to prevent one or more additional
notifications from being queued.  This is not an error, this is, I think
normal behavior.

Greg



Guiding Li

unread,
Aug 29, 2019, 11:02:34 PM8/29/19
to NuttX
From the topic beginning, I was showing to you:

There was one opportunity to execute timer_timeout() in THREAD !!!
There was one opportunity to execute timer_timeout() in THREAD !!!
There was one opportunity to execute timer_timeout() in THREAD !!!
timer_timeout() does run in a Task contexts !!!
timer_timeout() does run in a Task contexts !!!
timer_timeout() does run in a Task contexts !!!

This the rote:
THEAD -> wd_start() -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> timer_timeout()

This can easily reproduce.
Later I will show you the test code which can reproduce this phenomenon.


在 2019年8月29日星期四 UTC+8下午9:59:52,patacongo写道:

spudaneco

unread,
Aug 29, 2019, 11:19:15 PM8/29/19
to nu...@googlegroups.com
Yes you are right about those cases.  But those should be correct too because the reference counting should prevent any bad behavior.

Both cases have been analyzed, but so far I see no problem.

A few additional notifications after disableingthe timer is a possibility in normal operation.



Sent from Samsung tablet.

-------- Original message --------
From: Guiding Li <lig...@gmail.com>
Date: 8/29/19 9:02 PM (GMT-06:00)
Subject: Re: [nuttx] sched/timer: wdog timer already deleted by timer_delete() successful but still ringing

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/fe39b999-6727-4afb-948d-77ee9e8dc618%40googlegroups.com.

Guiding Li

unread,
Aug 29, 2019, 11:46:04 PM8/29/19
to NuttX

All my confuse about timer is just:  additional notifications after calling the timer_delete().

If you think:
A few additional notifications after disableing the timer is a possibility in normal operation.
Then that's OK. ^_^

But however, there is a risk on "timer_timeout() does run in a Task contexts".
In one big & complex system, this maybe caused some very bad behavior.
So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".

在 2019年8月30日星期五 UTC+8上午11:19:15,patacongo写道:
To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

Guiding Li

unread,
Aug 30, 2019, 12:13:18 AM8/30/19
to NuttX

The attachment fishdemo.c can easily reproduce "timer_timeout() does run in a Task contexts"
You can have a try.

CONFIG_SCHED_TICKLESS should open.
fishdemo.c

patacongo

unread,
Aug 30, 2019, 12:13:55 AM8/30/19
to NuttX

All my confuse about timer is just:  additional notifications after calling the timer_delete().

If is a confusing subject to discuss.  And having different native languages makes it even harder. ;) 

If you think:
A few additional notifications after disableing the timer is a possibility in normal operation.
Then that's OK. ^_^

If is normal if the task that disables the periodic timer is delayed for a long time because of activity of higher priority tasks.  If that task is delayed longer than the timer period, then the timer will expire and one (or maybe more) notifications will be queued.  Thoses queued notifications may then be received AFTER the timer is disabled.

That would be a normal consequence of delaying the task that deletes the timer.

But however, there is a risk on "timer_timeout() does run in a Task contexts".
In one big & complex system, this maybe caused some very bad behavior.
So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".

I don't think that is necessarily true.  I am not aware of any problems in the design.  There is no requirement that timer timeout run in a task context, provided that it runs within a critical section so that interrupts are disabled just like in an interrupt handler.  Why do you think that must happen?

Greg

spudaneco

unread,
Aug 30, 2019, 12:34:07 AM8/30/19
to nu...@googlegroups.com




Sent from Samsung tablet.



But however, there is a risk on "timer_timeout() does run in a Task contexts".
In one big & complex system, this maybe caused some very bad behavior.
So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".

I don't think that is necessarily true.  I am not aware of any problems in the design.  There is no requirement that timer timeout run in a task context, provided that it runs within a critical section so that interrupts are disabled just like in an interrupt handler.  Why do you think that must happen?

Is your concern that timer_timeout() could be suspended, then re-entered with a real timer interrupt?  I don't think that could happen in tickless mode.  Has the interval timer been restarted?  I would have check  but I don't think so.

It certainly could not be re-entered for this same timer because it has not yet been restarted at the time of the possible suspension.

Greg

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/4fca057c-9370-45d7-8a1a-7fe0c40e1c7a%40googlegroups.com.

Guiding Li

unread,
Aug 30, 2019, 1:00:42 AM8/30/19
to NuttX

I don't think that is necessarily true.  I am not aware of any problems in the design.  There is no requirement that timer timeout run in a task context, provided that it runs within a critical section so that interrupts are disabled just like in an interrupt handler.  Why do you think that must happen?

that it runs within a critical section so that interrupts are disabled just like in an interrupt handler
"Thead run timer_timeout() within a critical section"   is different with   "timer_timeout() in an interrupt handler".

"Thead run timer_timeout() within a critical section" ----- will switch to new thread initiative.
"timer_timeout() in an interrupt handler"  ---- will not switch to new thread until interrupt handler done.


Is your concern that timer_timeout() could be suspended, then re-entered with a real timer interrupt?  I don't think that could happen in tickless mode.  Has the interval timer been restarted?  I would have check  but I don't think so.

It certainly could not be re-entered for this same timer because it has not yet been restarted at the time of the possible suspension.

NO, this not my concern.

My concern is:
sched/wdog/wd_start.c --- wd_expiration() can be called in TASK context, it can switch to new TASK context while wdog expiring. After switching to new TASK, maybe we are not in a critical section, then can interrupt by real interrupt.


在 2019年8月30日星期五 UTC+8下午12:34:07,patacongo写道:




Sent from Samsung tablet.



But however, there is a risk on "timer_timeout() does run in a Task contexts".
In one big & complex system, this maybe caused some very bad behavior.
So we need find a way to make the "timer_timeout() must be called in IRQ hanlder".

I don't think that is necessarily true.  I am not aware of any problems in the design.  There is no requirement that timer timeout run in a task context, provided that it runs within a critical section so that interrupts are disabled just like in an interrupt handler.  Why do you think that must happen?

Is your concern that timer_timeout() could be suspended, then re-entered with a real timer interrupt?  I don't think that could happen in tickless mode.  Has the interval timer been restarted?  I would have check  but I don't think so.

It certainly could not be re-entered for this same timer because it has not yet been restarted at the time of the possible suspension.

Greg

--
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nu...@googlegroups.com.

spudaneco

unread,
Aug 30, 2019, 1:15:21 AM8/30/19
to nu...@googlegroups.com




Sent from Samsung tablet.


My concern is:
sched/wdog/wd_start.c --- wd_expiration() can be called in TASK context, it can switch to new TASK context while wdog expiring. After switching to new TASK, maybe we are not in a critical section, then can interrupt by real interrupt.

Is there some reason to believe that is a problem?

Guiding Li

unread,
Aug 30, 2019, 12:24:40 PM8/30/19
to NuttX
No, have no strong enough proof to support this risk. Currently.
I will report it you if I meet likely problem.
Just keep in mind now.

在 2019年8月30日星期五 UTC+8下午1:15:21,patacongo写道:

patacongo

unread,
Aug 30, 2019, 1:07:44 PM8/30/19
to NuttX



No, have no strong enough proof to support this risk. Currently.
I will report it you if I meet likely problem.
Just keep in mind now.

Okay.  Let me know.  I really do want to be helpful.  I do not want any bugs in the OS.

I do argue a lot, the purpose is to get to the bottom of the issues.

Guiding Li

unread,
Aug 30, 2019, 11:54:16 PM8/30/19
to NuttX
I meet likely error before:

This is IDLE thread.

while (1)
  {
    enter_critical_section();

    pm_changestate()-> pm_notify(PM_NORMAL/../PM_SLEEP);   ==> I2C suspend clk

    WFI;

    pm_changestate()-> pm_notify(PM_RESTORE);                        ==> I2C restore clk

    leave_critical_section();
  }

In my system, as common method, if system IDLE then execute IDLE thread and do WFI.

This is the correctly system modle we hope:

Before WFI, we use pm_notify(PM_NORMAL/../PM_SLEEP) to notify lower-drivers that we are 
going to a low power state. Some lower-drivers may suspend their clks to save power.

After WFI, we use pm_notify(PM_RESTORE) to notify lower-drivers that we are exiting low 
power state. Some lower-drivers can resotre their clks to normal.


But now coming the problem.

while (1)
  {
    enter_critical_section();

    pm_changestate()-> pm_notify(PM_NORMAL/../PM_SLEEP);   ==> I2C suspend clk
    pm_timer() -> wd_start();

    WFI;

    pm_changestate()-> pm_notify(PM_RESTORE);              ==> I2C restore clk

    leave_critical_section();
  }

wd_start() switch out to other thread, and this thread may access I2C and meet fatal error, because I2C clk suspend this time.
pm_timer() -> wd_start() -> sched_timer_cancel() -> nxsched_timer_process() -> wd_timer() -> wd_expiration() -> wdog->func() -> nxsig_timeout() -> up_unblock_task()


This error have the common point with wd_expiration() called in TASK context:
enter_critical_section()
// switch out by wd_start() -> up_unblock_task() acvtively.
leave_critical_section()

Guiding

patacongo

unread,
Aug 31, 2019, 11:24:37 AM8/31/19
to NuttX

Another warning:  sched_lock() would not be the correct interface to use if you are in an interrupt handler OR even if you are not in an interrupt handler but you want to establish a critical section.  The function of sched_lock() is to prevent your thread from being suspended so that it will keep running until sched_unlock().

If there is only a single CPU, then sched_lock() does behave kind of like a critical section.  But in SMP mode with multiple CPUs it does not work that way:  The thread will still keep running while the scheduler is locked, however, it will not effect tasks running on the other CPUs.


Actually, in this case, sched_lock() might work in the SMP case.  sched_lock() does not prevent interference from tasks running on other CPUs, will it prevent new tasks from being unblocked.  So perhaps it would work as you like.

There is another SMP issues:  The timer interrupt executes on only of the CPUs.  So if the timer interrupt occurs on a different CPU, the interrupt will still be taken.  enter_critical_section() can only disable interrupts on the local CPU, not on the other CPUs.  However, if enter_critical_section() is called in in the wdog expiration patch, then that will stop the timer interrupt handler on a spin lock and still prevent any re-entrancy.  That would be work studying more.

Guiding Li

unread,
Sep 2, 2019, 2:27:49 AM9/2/19
to NuttX
These two scenario cases common point:
enter_critical_section()
// switch out by wd_start() -> wd_expiration() -> up_unblock_task() acvtively.
leave_critical_section()

How about we cut off the route:
// wd_start() -> wd_expiration()           // wd_start() have no chance to call wd_expiration()

Is this resolved way suitable for our system ?

Gregory Nutt

unread,
Sep 2, 2019, 9:24:07 AM9/2/19
to nu...@googlegroups.com

I believe that that is incorrect.  When you start a new timer, there are two possibilities:

1. The current interval delay is shorter than the new timer delay.  In this case, nothing really has to be done.  The next timer expiration does not change.  What you suggest above works in that case.

2. The current interval delay is longer than the new timer delay.  In this case, we must a) cancel the current interval timer, b) update the time for the elapsed time on the canceled timer, then c) restart the interval timer with shorter delay.  That is what wd_expiration() does that and that is why it must be called.

Normally, calling wd_expiration() has not unusual side effects.  There is only one case:  If the interval timer has already expired.  Since interrupts are disabled in the timer code, that interrupt will pend.  In this case, calling wd_expiration() can cause a timer to expire in 2b.  But this is all good and normal.  I would not change it.

Greg



Reply all
Reply to author
Forward
0 new messages