Ntpd in uninterruptible sleep?

76 views
Skip to first unread message

A C

unread,
Oct 23, 2011, 3:28:08 PM10/23/11
to
A continuation of my debugging ntpd on a sparc system running NetBSD.
First, using -N results in a system lockup after a period of time if I
am polling ntpd remotely with ntpq. I removed the -N option from
startup and the system stayed stable even when querying at 5 second
intervals.

However, I just ran into a new problem. Here's the output of ps ax for
the ntpd entry:

22887 ? Ds 20:17.58 /usr/sbin/ntpd


Uninterruptible sleep? The daemon is indeed asleep because it stopped
responding to my ntpq queries but this time the system stayed up. I
suspect it was this same sleep that occurred in the other condition but
the high priority locked up the machine. I was able to kill and restart
ntpd without having to fully reboot. All other processes were still
running normally but slowly (see below).

More interesting is that the cpu was pegged until I was able to kill and
restart ntpd. Most of the cpu was devoted to ntpd during this locked up
period. Simple things like typing at the console were difficult. It
would take a few seconds for a keypress to register on the screen. Once
ntpd was restarted the system responded normally and the cpu usage
dropped to normal levels.

This is still version 4.2.6p3. I should probably go ahead and compile
the most recently released version but I'm at a loss to understand why
it happened.

Harlan Stenn

unread,
Oct 23, 2011, 4:02:06 PM10/23/11
to
GDB (or something similar) may be your friend.

David Woolley

unread,
Oct 23, 2011, 6:08:17 PM10/23/11
to
A C wrote:
>
> However, I just ran into a new problem. Here's the output of ps ax for
> the ntpd entry:
>
> 22887 ? Ds 20:17.58 /usr/sbin/ntpd
>
>
> Uninterruptible sleep? The daemon is indeed asleep because it stopped
> responding to my ntpq queries but this time the system stayed up. I

An uninterruptible sleep state isn't something that you can explicitly
enter from user space. This suggest a kernel or device driver bug.

unruh

unread,
Oct 23, 2011, 7:06:28 PM10/23/11
to
Is there some problem with your ethernet card? Is ntpd polling and
unresponsive card like mad? Ie, is hardware a problem?

A C

unread,
Oct 23, 2011, 7:22:36 PM10/23/11
to
I'll talk to the NetBSD list and see what they have to say but it's very
odd considering everything else on the system was still working
including the serial ports. I have gpsd running and it was still
receiving data from the attached GPS receiver and responding to remote
network requests from a status summary page on another machine.

A C

unread,
Oct 23, 2011, 7:19:30 PM10/23/11
to
The network interface (AUI with a 10BT transceiver, Lance Ethernet) was
up and running the entire time. I had xclock running on the system with
the display exported to another machine and I also had an ssh session in
an xterm up and running. Both of those continued to function (albeit
slowly) while ntpd was stuck in that sleep state. I actually used the
ssh-in-xterm session to kill and restart ntpd. There was no significant
polling on the interface (activity LEDs showed normal behavior).

Uwe Klein

unread,
Oct 23, 2011, 5:43:51 PM10/23/11
to
A C wrote:
> More interesting is that the cpu was pegged until I was able to kill and
> restart ntpd. Most of the cpu was devoted to ntpd during this locked up
> period. Simple things like typing at the console were difficult. It
> would take a few seconds for a keypress to register on the screen. Once
> ntpd was restarted the system responded normally and the cpu usage
> dropped to normal levels.
>
> This is still version 4.2.6p3. I should probably go ahead and compile
> the most recently released version but I'm at a loss to understand why
> it happened.

CPU (over)loaded
or the system is swapping like mad ?
( I'd think it is swapping? )

uwe

A C

unread,
Oct 24, 2011, 1:31:31 AM10/24/11
to
Mostly CPU overload. I can hear the hard drive heads on this particular
machine and it did indeed need to swap quite a bit but the CPU was most
certainly pegged out at the same time. If I didn't touch the machine,
the drives were silent and only the CPU was churning.

unruh

unread,
Oct 24, 2011, 9:24:43 AM10/24/11
to
That would destroy everything ( ie slow everything to a crawl). Also
ntpd is a small program and especially at heightened niceness (which he
said he used) should not get swapped out or affected.

A C

unread,
Oct 24, 2011, 10:52:16 AM10/24/11
to
Right, the swapping would only occur if I was trying to actively do
something while diagnosing the problem. Otherwise the system load is so
low there's no real need to swap. There is no desktop environment
running on the system, only the standard daemons plus some extras (sshd,
ntpd, gpsd, crond, syslogd, inetd, postfix for system messages only) and
then one xterm (with ssh session inside) and one xclock. The cron jobs
are mostly system housekeeping (log rotation, etc.) that occur at 0:00
but the crashes would occur at other times so none of the cron jobs were
running at the time of any crash.

I've got extra RAM sticks that I pulled out during debugging in case I
had a bad memory cell but that turns out not to be the case so I'll need
to add all the sticks back in. Still the load isn't quite large enough
to swap everything out. Right now the machine is quiet and not swapping
anything unless I run something resource intensive.

The header from top when things work normally (ntpd no longer running at
high priority in this capture):

> load averages: 0.10, 0.09, 0.02; up 14+20:55:09 14:44:43
> 22 processes: 21 sleeping, 1 on CPU
> CPU states: 10.5% user, 0.0% nice, 7.4% system, 6.4% interrupt, 75.8% idle
> Memory: 5048K Act, 2432K Inact, 300K Wired, 2316K Exec, 1336K File, 200K Free
> Swap: 128M Total, 9272K Used, 119M Free
>
> PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
> 5738 root 85 0 5416K 1004K pause 19:25 0.24% 0.24% ntpd
> 0 root 125 0 0K 5608K cachegc 140:35 0.05% 0.05% [system]
> 410 agcarver 85 0 7328K 1212K select 27:54 0.05% 0.05% xterm
> 323 nobody 90 -10 13M 784K select 37.7H 0.00% 0.00% gpsd
> 533 agcarver 85 0 11M 728K select 106:58 0.00% 0.00% sshd
> 415 agcarver 85 0 7172K 676K select 94:31 0.00% 0.00% xclock

unruh

unread,
Oct 24, 2011, 2:48:13 PM10/24/11
to
On 2011-10-24, A C <agcarv...@acarver.net> wrote:
> On 10/24/2011 06:24, unruh wrote:
>> On 2011-10-23, Uwe Klein<u...@klein-habertwedt.de> wrote:
>>> A C wrote:
>>>> More interesting is that the cpu was pegged until I was able to kill and
>>>> restart ntpd. Most of the cpu was devoted to ntpd during this locked up
>>>> period. Simple things like typing at the console were difficult. It
>>>> would take a few seconds for a keypress to register on the screen. Once
>>>> ntpd was restarted the system responded normally and the cpu usage
>>>> dropped to normal levels.
>>>>
>>>> This is still version 4.2.6p3. I should probably go ahead and compile
>>>> the most recently released version but I'm at a loss to understand why
>>>> it happened.
>>>
>>> CPU (over)loaded
>>> or the system is swapping like mad ?
>>
>> That would destroy everything ( ie slow everything to a crawl). Also
>> ntpd is a small program and especially at heightened niceness (which he
>> said he used) should not get swapped out or affected.
>>
>>> ( I'd think it is swapping? )
>>>
>
> Right, the swapping would only occur if I was trying to actively do
> something while diagnosing the problem. Otherwise the system load is so
> low there's no real need to swap. There is no desktop environment

? swapping occurs if system memory fills up completely and there is no
more memory. It does not just refer to any disk access. Run top and see
if there is a substantial swap useage. If there issomething is very
wrong.

> running on the system, only the standard daemons plus some extras (sshd,
> ntpd, gpsd, crond, syslogd, inetd, postfix for system messages only) and
> then one xterm (with ssh session inside) and one xclock. The cron jobs
> are mostly system housekeeping (log rotation, etc.) that occur at 0:00
> but the crashes would occur at other times so none of the cron jobs were
> running at the time of any crash.

NOne of those would cause swapping unless you only have 1MB of memory.

Chuck Swiger

unread,
Oct 24, 2011, 2:40:52 PM10/24/11
to
On Oct 24, 2011, at 7:52 AM, A C wrote:
> The header from top when things work normally (ntpd no longer running at high priority in this capture):
>
>> load averages: 0.10, 0.09, 0.02; up 14+20:55:09 14:44:43
>> 22 processes: 21 sleeping, 1 on CPU
>> CPU states: 10.5% user, 0.0% nice, 7.4% system, 6.4% interrupt, 75.8% idle
>> Memory: 5048K Act, 2432K Inact, 300K Wired, 2316K Exec, 1336K File, 200K Free
>> Swap: 128M Total, 9272K Used, 119M Free

This box has ~8MB of RAM?

Well, that was a reasonable amount for a ~1990-era machine, and NetBSD will still run on it today, but running X11 clients is sure to push it into swapping out other processes entirely. That will cause ntpd (and gpsd, etc) to stop responding entirely for many seconds while the hard drive chugs-- that's the "uninterruptible sleep" you initially asked about.

Put more RAM into the box, or replace it with something from this or at least the previous decade. :-)

Regards,
--
-Chuck

A C

unread,
Oct 24, 2011, 4:54:49 PM10/24/11
to
I was making the statement that swapping occured only when I was
actively attempting to debug the situation. Leaving the machine alone
did not result in sustained swapping.

>
>> running on the system, only the standard daemons plus some extras (sshd,
>> ntpd, gpsd, crond, syslogd, inetd, postfix for system messages only) and
>> then one xterm (with ssh session inside) and one xclock. The cron jobs
>> are mostly system housekeeping (log rotation, etc.) that occur at 0:00
>> but the crashes would occur at other times so none of the cron jobs were
>> running at the time of any crash.
>
> NOne of those would cause swapping unless you only have 1MB of memory.

I understand that, I was only documenting them for completeness.

A C

unread,
Oct 24, 2011, 4:58:47 PM10/24/11
to
It has 16 MB of RAM right now (one stick). It had 64 at one point while
I was trying to diagnose this issue. I thought I might have a bad stick
of RAM so I was testing them one by one (one stick in the machine at a
time). I didn't reinsert the other three yet. But it doesn't matter,
even with 64 MB ntpd still crashes which is what started this whole
debugging game in the first place.

Besides, I wouldn't call an Xterm and Xclock anything significant in the
way of X11 clients. I'm certainly not intending to run anything beyond
those. The Xterm is for a console and the Xclock I was using to keep an
eye on the machine so I'd know if the system locked up.

Chuck Swiger

unread,
Oct 24, 2011, 5:11:31 PM10/24/11
to
On Oct 24, 2011, at 1:58 PM, A C wrote:
> It has 16 MB of RAM right now (one stick). It had 64 at one point while I was trying to diagnose this issue. I thought I might have a bad stick of RAM so I was testing them one by one (one stick in the machine at a time). I didn't reinsert the other three yet. But it doesn't matter, even with 64 MB ntpd still crashes which is what started this whole debugging game in the first place.

If ntpd crashes, you should get a coredump which you can debug (assuming you've setup the coredumpsize limit to permit this) and perhaps a syslog message about a SIGSEGV, SIGBUS, or whatever.

> Besides, I wouldn't call an Xterm and Xclock anything significant in the way of X11 clients. I'm certainly not intending to run anything beyond those. The Xterm is for a console and the Xclock I was using to keep an eye on the machine so I'd know if the system locked up.

If you had 64MB of RAM, then the two clients wouldn't or shouldn't cause your system to swap. With only 16MB, however, you're sure to be paging, and more likely to be swapping entire processes out to disk

Regards,
--
-Chuck

A C

unread,
Oct 24, 2011, 5:26:52 PM10/24/11
to
On 10/24/2011 14:11, Chuck Swiger wrote:
> On Oct 24, 2011, at 1:58 PM, A C wrote:
>> It has 16 MB of RAM right now (one stick). It had 64 at one point while I was trying to diagnose this issue. I thought I might have a bad stick of RAM so I was testing them one by one (one stick in the machine at a time). I didn't reinsert the other three yet. But it doesn't matter, even with 64 MB ntpd still crashes which is what started this whole debugging game in the first place.
>
> If ntpd crashes, you should get a coredump which you can debug (assuming you've setup the coredumpsize limit to permit this) and perhaps a syslog message about a SIGSEGV, SIGBUS, or whatever.

Not if it locks the system up entirely (as in I couldn't even break out
of the kernel) which is what had been happening. Turning off the
priority on ntpd (eliminating the use of the -N flag) seems to avoid the
system lock up issue but doesn't eliminate the ntpd lockup. The process
just sat there yesterday not doing anything and bogging the machine
down. The process itself doesn't die, there's no core to dump (yet), it
just sits and sits and sits...

>
>> Besides, I wouldn't call an Xterm and Xclock anything significant in the way of X11 clients. I'm certainly not intending to run anything beyond those. The Xterm is for a console and the Xclock I was using to keep an eye on the machine so I'd know if the system locked up.
>
> If you had 64MB of RAM, then the two clients wouldn't or shouldn't cause your system to swap. With only 16MB, however, you're sure to be paging, and more likely to be swapping entire processes out to disk

Again, the amount of memory is a distraction from the issue because even
with the 64 MB ntpd was having problems. But, I'll install the other
three sticks this week when I have an opportunity which will mostly
eliminate swapping issues.

Chuck Swiger

unread,
Oct 24, 2011, 5:33:07 PM10/24/11
to
On Oct 24, 2011, at 2:26 PM, A C wrote:
>> If ntpd crashes, you should get a coredump which you can debug (assuming you've setup the coredumpsize limit to permit this) and perhaps a syslog message about a SIGSEGV, SIGBUS, or whatever.
>
> Not if it locks the system up entirely (as in I couldn't even break out of the kernel) which is what had been happening. Turning off the priority on ntpd (eliminating the use of the -N flag) seems to avoid the system lock up issue but doesn't eliminate the ntpd lockup.

That sounds somewhat like an OS bug, where running at very high priority might livelock the system and prevent it from servicing network traffic or interrupts. You haven't mentioned which version of NetBSD you are running; it's possible that it's a known issue which has since been fixed.

Or it could just be something else in the H/W getting flakey...

> The process just sat there yesterday not doing anything and bogging the machine down. The process itself doesn't die, there's no core to dump (yet), it just sits and sits and sits...

This case should be debuggable-- you can to use gdb to attach to the process and see where it is, or fire off ptrace against it.

Regards,
--
-Chuck

A C

unread,
Oct 24, 2011, 5:38:05 PM10/24/11
to
On 10/24/2011 14:33, Chuck Swiger wrote:
> On Oct 24, 2011, at 2:26 PM, A C wrote:
>>> If ntpd crashes, you should get a coredump which you can debug (assuming you've setup the coredumpsize limit to permit this) and perhaps a syslog message about a SIGSEGV, SIGBUS, or whatever.
>>
>> Not if it locks the system up entirely (as in I couldn't even break out of the kernel) which is what had been happening. Turning off the priority on ntpd (eliminating the use of the -N flag) seems to avoid the system lock up issue but doesn't eliminate the ntpd lockup.
>
> That sounds somewhat like an OS bug, where running at very high priority might livelock the system and prevent it from servicing network traffic or interrupts. You haven't mentioned which version of NetBSD you are running; it's possible that it's a known issue which has since been fixed.
>
> Or it could just be something else in the H/W getting flakey...

NetBSD 5.1 (most recent) but the hardware runs fine. In fact without
the high priority everything else on the system worked. I had gpsd
running in the background reading the GPS and servicing data requests
even though ntpd was stuck.

>
>> The process just sat there yesterday not doing anything and bogging the machine down. The process itself doesn't die, there's no core to dump (yet), it just sits and sits and sits...
>
> This case should be debuggable-- you can to use gdb to attach to the process and see where it is, or fire off ptrace against it.

I'll try it again when it happens. It was hard to do last time because
the system was completely unresponsive. This time, now that I know
things will still function, I might get some more useful data.

Richard B. Gilbert

unread,
Oct 24, 2011, 7:04:50 PM10/24/11
to
64MB is not a lot of RAM by today's standards. Can you install more for
testing purposes?


A C

unread,
Oct 24, 2011, 8:16:00 PM10/24/11
to
On 10/24/2011 16:04, Richard B. Gilbert wrote:

>
> 64MB is not a lot of RAM by today's standards. Can you install more for
> testing purposes?

Unfortunately not. There are only four slots and I don't have any
sticks larger than 16MB so 64 is it for now. However, for the little
that the machine does (it's only job is ntpd and gpsd) I think 64 should
be adequate.

Uwe Klein

unread,
Oct 25, 2011, 5:06:01 AM10/25/11
to
Heheh, still got my MVME167 with 8MB of RAM running SysVR3.6.

Are you using the local machine as X-Display?

Can you go down to a nongraphical runstate and
work via a remote (telnet) session from another box?

If nothing else works kill the wedged process hard
to create a core dump?

uwe

A C

unread,
Oct 25, 2011, 10:30:01 AM10/25/11
to
I am not using local graphics. There is no desktop environment in use
on this machine at all. In fact there is no monitor on the machine (at
least not full time). My Xterm and Xclock are exported to a remote machine.

I need to wait for the process to stall again. It will take a couple
days for that to happen and then I'll try a kill-9 after trying to run a
stack trace. A SIGTERM kill managed to stop the ntpd process this time
but it never generated core.

Uwe Klein

unread,
Oct 25, 2011, 12:22:31 PM10/25/11
to
A C wrote:
> I am not using local graphics. There is no desktop environment in use
> on this machine at all. In fact there is no monitor on the machine (at
> least not full time). My Xterm and Xclock are exported to a remote
> machine.
OK
>
> I need to wait for the process to stall again. It will take a couple
> days for that to happen and then I'll try a kill-9 after trying to run a
> stack trace. A SIGTERM kill managed to stop the ntpd process this time
> but it never generated core.

SIGTERM usually doesn't create a core dump ( SIGKILL neither ;-).
( think of the hassle with all the core files when switching runlevels ;-)

you may have to adjust max coresize via ?ulimit?

uwe

A C

unread,
Oct 25, 2011, 4:37:04 PM10/25/11
to
On 10/25/2011 09:22, Uwe Klein wrote:

>> I need to wait for the process to stall again. It will take a couple
>> days for that to happen and then I'll try a kill-9 after trying to run
>> a stack trace. A SIGTERM kill managed to stop the ntpd process this
>> time but it never generated core.
>
> SIGTERM usually doesn't create a core dump ( SIGKILL neither ;-).
> ( think of the hassle with all the core files when switching runlevels ;-)

Well right, I'm just pointing out that no core was created this time. I
have unlimited core size configured right now so I will get a core file
during the next kill.

unruh

unread,
Oct 25, 2011, 5:53:08 PM10/25/11
to
On 2011-10-25, A C <agcarv...@acarver.net> wrote:
> On 10/25/2011 02:06, Uwe Klein wrote:
>> A C wrote:
>>> On 10/24/2011 16:04, Richard B. Gilbert wrote:
>>>
>>>>
>>>> 64MB is not a lot of RAM by today's standards. Can you install more for
>>>> testing purposes?
>>>
>>>
>>> Unfortunately not. There are only four slots and I don't have any
>>> sticks larger than 16MB so 64 is it for now. However, for the little
>>> that the machine does (it's only job is ntpd and gpsd) I think 64
>>> should be adequate.
>> Heheh, still got my MVME167 with 8MB of RAM running SysVR3.6.
>>
>> Are you using the local machine as X-Display?
>>
>> Can you go down to a nongraphical runstate and
>> work via a remote (telnet) session from another box?
>>
>> If nothing else works kill the wedged process hard
>> to create a core dump?
>
> I am not using local graphics. There is no desktop environment in use
> on this machine at all. In fact there is no monitor on the machine (at
> least not full time). My Xterm and Xclock are exported to a remote machine.

?? So why are you using X at all? Xterm can surely be replaced by the
console. And xclock by running date to tell you what time the machine
thinks it is (I assume you use that to check to see if ntpd has stopped
controlling the clock)

John Hasler

unread,
Oct 25, 2011, 6:30:21 PM10/25/11
to
unruh writes:
> So why are you using X at all?

He isn't running X at all (or at least he doesn't need to be). He's
just running a couple of X clients that are talking to the X server on
his workstation.
--
John Hasler
jha...@newsguy.com
Dancing Horse Hill
Elmwood, WI USA

A C

unread,
Oct 25, 2011, 6:56:23 PM10/25/11
to
On 10/25/2011 14:53, unruh wrote:
> On 2011-10-25, A C<agcarv...@acarver.net> wrote:
>> On 10/25/2011 02:06, Uwe Klein wrote:
>>> A C wrote:
>>>> On 10/24/2011 16:04, Richard B. Gilbert wrote:
>>>>
>>>>>
>>>>> 64MB is not a lot of RAM by today's standards. Can you install more for
>>>>> testing purposes?
>>>>
>>>>
>>>> Unfortunately not. There are only four slots and I don't have any
>>>> sticks larger than 16MB so 64 is it for now. However, for the little
>>>> that the machine does (it's only job is ntpd and gpsd) I think 64
>>>> should be adequate.
>>> Heheh, still got my MVME167 with 8MB of RAM running SysVR3.6.
>>>
>>> Are you using the local machine as X-Display?
>>>
>>> Can you go down to a nongraphical runstate and
>>> work via a remote (telnet) session from another box?
>>>
>>> If nothing else works kill the wedged process hard
>>> to create a core dump?
>>
>> I am not using local graphics. There is no desktop environment in use
>> on this machine at all. In fact there is no monitor on the machine (at
>> least not full time). My Xterm and Xclock are exported to a remote machine.
>
> ?? So why are you using X at all? Xterm can surely be replaced by the
> console. And xclock by running date to tell you what time the machine
> thinks it is (I assume you use that to check to see if ntpd has stopped
> controlling the clock)

A few reasons for not using console:
1. I don't have a monitor to devote to the machine that works at
standard Sun framebuffer refresh rates. With the odd resolution and
rate of the CG6 most inexpensive LCDs just fall over and refuse to sync.
My one monitor that can sync is also the nice one that I use for my
regular desktop machine.

2. Even with a monitor on the machine there's a bug in the console
driver in NetBSD that locks up the console getty if redraws occur too
quickly. It's a known bug on the sparc hardware but hard to track down.
It doesn't affect anything else except the getty on the console.

3. I can leave the Xterm on my screen on my everyday machine and be able
to use it when I need to without needing to reach for another keyboard.
Plus anything I do on the machine, especially during debugging right
now, I can copy and paste to post to the list or save for later.

You are mostly correct about Xclock, it's just there to tell me when the
machine simply dies while I'm debugging these lockup issues. I wouldn't
usually have the clock up and running but right now it comes in handy
for verifying that the network/kernel/base system is still running.
Checking ntpd happens via another machine and ntpq. It's my intent to
have a cron poll once every few hours, make sure all is good, and fire
off an email warning if it's not. If I can get ntpd and this machine to
behave I'll probably never have an email and it can sit quietly keeping
time for my network.

A C

unread,
Oct 25, 2011, 7:07:32 PM10/25/11
to
On 10/25/2011 15:30, John Hasler wrote:
> unruh writes:
>> So why are you using X at all?
>
> He isn't running X at all (or at least he doesn't need to be). He's
> just running a couple of X clients that are talking to the X server on
> his workstation.

Correct, there are no window managers or any other X environment running
on the ntpd system. Just the two clients (xterm, xclock) with their
displays exported to another machine.

> PID TTY STAT TIME COMMAND
> 0 ? DKl 154:17.69 [system]
> 1 ? IWs 0:00.78 init
> 115 ? Ss 1:20.74 /usr/sbin/syslogd -s
> 249 ? IWs 0:00.30 /usr/sbin/sshd
> 323 ? S<s 2464:24.35 gpsd -bnG /dev/gps1
> 335 ? Is 3:31.65 /usr/libexec/postfix/master
> 363 ? IWs 0:05.87 sshd: agcarver [priv]
> 374 ? IWs 0:00.06 /usr/sbin/inetd -l
> 391 ? I 2:35.66 qmgr -l -t unix -u
> 400 ? Is 1:40.46 /usr/sbin/cron
> 408 ? S 2:30.33 sshd: agcarver@notty
> 410 ? S 30:19.76 /usr/X11R6/bin/xterm
> 415 ? S 103:06.24 /usr/X11R6/bin/xclock
> 508 ? IWs 0:06.32 sshd: agcarver [priv]
> 533 ? S 116:39.31 sshd: agcarver@notty
> 5738 ? Ss 52:10.90 /usr/sbin/ntpd
> 27213 ? I 0:01.09 pickup -l -t fifo -u
> 402 ? IWs+ 0:00.31 /usr/libexec/getty suncons console
> 34 ttyp0 IW 0:01.84 -csh
> 421 ttyp0 IWs 0:00.48 bash
> 1047 ttyp0 S 0:21.31 bash
> 25333 ttyp0 O+ 0:00.22 ps -ax

Steve Kostecke

unread,
Oct 25, 2011, 11:46:46 PM10/25/11
to
On 2011-10-25, A C <agcarv...@acarver.net> wrote:
> On 10/25/2011 14:53, unruh wrote:
>>
>> ?? So why are you using X at all? Xterm can surely be replaced by the
>> console. And xclock by running date to tell you what time the machine
>> thinks it is (I assume you use that to check to see if ntpd has stopped
>> controlling the clock)
>
> A few reasons for not using console:
> 1. I don't have a monitor to devote to the machine that works at
> standard Sun framebuffer refresh rates. With the odd resolution and
> rate of the CG6 most inexpensive LCDs just fall over and refuse to sync.
> My one monitor that can sync is also the nice one that I use for my
> regular desktop machine.
>
> 2. Even with a monitor on the machine there's a bug in the console
> driver in NetBSD that locks up the console getty if redraws occur too
> quickly. It's a known bug on the sparc hardware but hard to track down.
> It doesn't affect anything else except the getty on the console.
>
> 3. I can leave the Xterm on my screen on my everyday machine and be able
> to use it when I need to without needing to reach for another keyboard.
> Plus anything I do on the machine, especially during debugging right
> now, I can copy and paste to post to the list or save for later.

Wouldn't it be easier to just open an Xterm on your everyday machine and
ssh into the time server?

--
Steve Kostecke <kost...@ntp.org>
NTP Public Services Project - http://support.ntp.org/

A C

unread,
Oct 26, 2011, 2:41:22 AM10/26/11
to
That's almost what I'm doing. I just happen to have a quick icon on my
regular machine that creates an ssh session and sends back an xterm.
It's really not causing much if any overhead on the Sun machine. Here's
the output from top with top itself running inside the xterm so
refreshes are occuring and causing the xterm and ssh to use a bit of
CPU. There isn't much load at all, mostly the garbage collector. I
haven't had a chance yet to put the memory back in the system so it's
still running with 16 MB.

> load averages: 0.17, 0.06, 0.01; up 16+12:49:20 06:38:54
> 22 processes: 21 sleeping, 1 on CPU
> CPU states: 11.4% user, 0.0% nice, 7.3% system, 4.9% interrupt, 76.3% idle
> Memory: 4824K Act, 2440K Inact, 300K Wired, 2180K Exec, 1412K File, 460K Free
> Swap: 128M Total, 8820K Used, 120M Free
>
> PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
> 0 root 125 0 0K 5392K cachegc 158:13 3.56% 3.56% [system]
> 29574 root 85 0 5416K 896K pause 1:54 1.76% 1.76% ntpd
> 408 agcarver 85 0 11M 752K select 2:40 1.56% 1.56% sshd
> 1047 root 85 0 3448K 496K wait 0:23 0.98% 0.98% bash
> 533 agcarver 85 0 11M 668K select 118:57 0.78% 0.78% sshd
> 323 nobody 90 -10 13M 576K select 41.9H 0.00% 0.00% gpsd
> 415 agcarver 85 0 7172K 504K select 105:06 0.00% 0.00% xclock
> 410 agcarver 85 0 7328K 836K select 31:01 0.00% 0.00% xterm
> 335 root 85 0 5708K 260K kqueue 3:37 0.00% 0.00% master
> 391 postfix 85 0 5720K 308K kqueue 2:39 0.00% 0.00% qmgr
> 400 root 85 0 3372K 244K nanoslp 1:43 0.00% 0.00% cron
> 115 root 85 0 3468K 288K kqueue 1:23 0.00% 0.00% syslogd
> 508 root 83 0 11M 0K netio 0:06 0.00% 0.00% <sshd>
> 363 root 85 0 11M 0K netio 0:05 0.00% 0.00% <sshd>
> 34 root 85 0 3292K 0K pause 0:01 0.00% 0.00% <csh>
> 926 root 43 0 3624K 1112K CPU 0:00 0.00% 0.00% top
> 28716 postfix 85 0 5708K 616K kqueue 0:00 0.00% 0.00% pickup
> 1 root 85 0 3472K 60K wait 0:00 0.00% 0.00% init
> 249 root 85 0 7592K 0K select 0:00 0.00% 0.00% <sshd>
> 421 agcarver 85 0 3448K 0K wait 0:00 0.00% 0.00% <bash>
> 402 root 85 0 3444K 0K ttyraw 0:00 0.00% 0.00% <getty>
> 374 root 83 0 3584K 0K kqueue 0:00 0.00% 0.00% <inetd>

A C

unread,
Oct 27, 2011, 3:58:46 PM10/27/11
to
Ok, so ntpd died again today. This time it's not stuck in
uninterruptible sleep, it's stuck in a loop running. The machine is
still working, I'm able to access it remotely but ntpd is refusing
network requests. Just to make sure there were no serial port problems
I killed gpsd while ntpd was still spinning out of control but it didn't
resolve. In fact gpsd was still working (so serial port communication
was still good) even though ntpd was dead. It took a few days for this
to happen. It was running just fine up until that point.

Here's the output from top:

> load averages: 2.00, 1.54, 1.14; up 18+01:54:28 19:44:02
> 21 processes: 1 runnable, 19 sleeping, 1 on CPU
> CPU states: 92.4% user, 0.0% nice, 3.8% system, 2.9% interrupt, 0.9% idle
> Memory: 3008K Act, 1500K Inact, 3336K Wired, 1232K Exec, 2132K File, 316K Free
> Swap: 128M Total, 10M Used, 118M Free
>
> PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
> 29574 root 29 0 9512K 3460K RUN 64:13 92.92% 92.92% ntpd
> 410 agcarver 85 0 7328K 272K select 33:48 2.69% 2.69% xterm
> 0 root 125 0 0K 5548K cachegc 173:53 0.78% 0.78% [system]
> 533 agcarver 85 0 11M 520K select 130:13 0.00% 0.00% sshd
> 415 agcarver 85 0 7172K 456K select 115:04 0.00% 0.00% xclock
> 335 root 85 0 5708K 344K kqueue 3:58 0.00% 0.00% master
> 391 postfix 85 0 5720K 440K kqueue 2:55 0.00% 0.00% qmgr
> 408 agcarver 85 0 11M 280K select 2:49 0.00% 0.00% sshd
> 400 root 85 0 3372K 244K nanoslp 1:52 0.00% 0.00% cron
> 115 root 85 0 3468K 168K kqueue 1:31 0.00% 0.00% syslogd
> 1047 root 85 0 3448K 80K wait 0:23 0.00% 0.00% <bash>
> 508 root 83 0 11M 0K netio 0:06 0.00% 0.00% <sshd>
> 363 root 85 0 11M 0K netio 0:05 0.00% 0.00% <sshd>
> 3988 root 43 0 3624K 344K CPU 0:02 0.00% 0.00% top
> 28670 postfix 85 0 5708K 252K kqueue 0:01 0.00% 0.00% pickup
> 34 root 85 0 3292K 0K pause 0:01 0.00% 0.00% <csh>
> 1 root 85 0 3472K 12K wait 0:00 0.00% 0.00% init
> 249 root 85 0 7592K 0K select 0:00 0.00% 0.00% <sshd>
> 421 agcarver 85 0 3448K 0K wait 0:00 0.00% 0.00% <bash>
> 402 root 85 0 3444K 0K ttyraw 0:00 0.00% 0.00% <getty>
> 374 root 83 0 3584K 0K kqueue 0:00 0.00% 0.00% <inetd>
>

I attempted a ktrace but I didn't get anything from it. It wasn't able
to attach to the process. I'll have to try again when it crashes next time.

Fran

unread,
Oct 27, 2011, 7:13:43 PM10/27/11
to
Can you send a signal to it that causes a core dump, and then trace the stack to see where it was ?

A C

unread,
Oct 27, 2011, 9:52:23 PM10/27/11
to
On 10/27/2011 16:13, Fran wrote:
> Can you send a signal to it that causes a core dump, and then trace the stack to see where it was ?

I'll do that next time it locks up (should be in a few days). I used a
regular kill to stop the ktrace. I didn't know that ktrace had failed
to grab the process until it was too late and the process was gone.

A C

unread,
Nov 4, 2011, 5:52:19 PM11/4/11
to
Ok, so ntpd does not respond to a SIGQUIT with a core dump but I did
manage to attach a trace to the process before killing it. The output
of ktrace is below. The circumstances of this particular lockup I was
actually able to observe. I camped out at the system waiting.

A cron job fired off which does routine disk maintenance (diffs of
config files, free space calculations, postfix cleanup, etc.) In this
case things got swapped around a bit while all the cleanup was occuring.
However, after those activities finished, ntpd never returned to
normal. It just spun out of control resulting in the trace below. Other
programs that were running at the same time (gpsd, xclock, xterm) all
recovered cleanly though the system was now bogged down by ntpd
consuming almost all the processor time even though it was not set to
high priority. The capture below pretty much loops continuously until
ntpd is finally killed. I actually let the system run for an additional
24 hours in this state just to see if it would bounce back but it never
did. I killed only one process, ntpd, and everything else was fine as
the CPU load dropped to near zero immediately.

1210 1 ntpd CALL clock_gettime(0,0xefffd0e8)
1210 1 ntpd RET clock_gettime 0, -268447512/0xefffd0e8
1210 1 ntpd CALL select(0x1c,0xefffd05c,0,0,0xefffd0b4)
1210 1 ntpd RET select 1, -268447652/0xefffd05c
1210 1 ntpd CALL
recvfrom(0x16,0xefffcc74,0x3e8,0,0xefffd098,0xefffd0ec)
1210 1 ntpd MISC msghdr: 28,
00000000f02cf7e0f25edeac00000001000000000001b58400000000
1210 1 ntpd GIO fd 22 read 12 bytes
"\^V\^A\0\^A\0\0\0\0\0\0\0\0"
1210 1 ntpd MISC sockname: 16, 1002de040a00008d0000000000000000
1210 1 ntpd RET recvfrom 12/0xc, -268448652/0xefffcc74
1210 1 ntpd CALL
recvfrom(0x16,0xefffcc74,0x3e8,0,0xefffd098,0xefffd0ec)
1210 1 ntpd MISC msghdr: 28,
00000000f02cf7e0f25edeac00000001000000000001b58400000000
1210 1 ntpd RET recvfrom -1 errno 35 Resource temporarily
unavailable
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd178)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd178)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd178)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd178)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGIO caught handler=0x6b670 mask=():
code=[1], fd=22, band=41)
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=(23):
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffce00)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd CALL clock_gettime(0,0xefffd018)
1210 1 ntpd RET clock_gettime 0, -268447720/0xefffd018
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=(23):
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffcdf8)
1210 1 ntpd RET setcontext JUSTRETURN
>> 1210 1 ntpd CALL select(0x1c,0xefffcf8c,0,0,0xefffcfe4)
1210 1 ntpd RET select 1, -268447860/0xefffcf8c
1210 1 ntpd CALL
recvfrom(0x16,0xefffcba4,0x3e8,0,0xefffcfc8,0xefffd01c)
1210 1 ntpd MISC msghdr: 28,
00000000f02cf7e0f25edeac00000001000000001039485c00000000
1210 1 ntpd GIO fd 22 read 12 bytes
"\^V\^A\0\^A\0\0\0\0\0\0\0\0"
1210 1 ntpd MISC sockname: 16, 1002e7210a00008d0000000000000000
1210 1 ntpd RET recvfrom 12/0xc, -268448860/0xefffcba4
1210 1 ntpd CALL
recvfrom(0x16,0xefffcba4,0x3e8,0,0xefffcfc8,0xefffd01c)
1210 1 ntpd MISC msghdr: 28,
00000000f02cf7e0f25edeac00000001000000001039485c00000000
1210 1 ntpd RET recvfrom -1 errno 35 Resource temporarily
unavailable
1210 1 ntpd CALL setcontext(0xefffd178)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGIO caught handler=0x6b670 mask=():
code=[1], fd=22, band=41)
1210 1 ntpd CALL clock_gettime(0,0xefffd0e8)
1210 1 ntpd RET clock_gettime 0, -268447512/0xefffd0e8
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=(23):
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffcec8)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd CALL select(0x1c,0xefffd05c,0,0,0xefffd0b4)
1210 1 ntpd RET select 1, -268447652/0xefffd05c
1210 1 ntpd CALL
recvfrom(0x16,0xefffcc74,0x3e8,0,0xefffd098,0xefffd0ec)
1210 1 ntpd MISC msghdr: 28,
00000000f02cf7e0f25edeac00000001000000000001b58400000000
1210 1 ntpd GIO fd 22 read 12 bytes
"\^V\^A\0\^A\0\0\0\0\0\0\0\0"
1210 1 ntpd MISC sockname: 16, 1002e7210a00008d0000000000000000
1210 1 ntpd RET recvfrom 12/0xc, -268448652/0xefffcc74
1210 1 ntpd CALL
recvfrom(0x16,0xefffcc74,0x3e8,0,0xefffd098,0xefffd0ec)
1210 1 ntpd MISC msghdr: 28,
00000000f02cf7e0f25edeac00000001000000000001b58400000000
1210 1 ntpd RET recvfrom -1 errno 35 Resource temporarily
unavailable
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd248)
1210 1 ntpd RET setcontext JUSTRETURN
1210 1 ntpd PSIG SIGALRM caught handler=0x3ea2c mask=():
code=SI_TIMER sigval 0x3)
1210 1 ntpd CALL setcontext(0xefffd178)
1210 1 ntpd RET setcontext JUSTRETURN
>

Dave Hart

unread,
Nov 4, 2011, 9:40:15 PM11/4/11
to
>From your netbsd.org mailing list traffic, I believe you're using
NetBSD 5.x. Looking at ntpd/ntp_io.c, recvfrom() is not the call I'd
expect to see happen, has NetBSD 5.x supports SO_TIMESTAMP, so #ifdef
HAVE_TIMESTAMP code is active, and ntpd would typically use recvmsg()
rather than recvfrom(). See read_network_packet() in ntpd/ntp_io.c I
say typically because if either

1. the particular local address ("interface") to which the socket is
bound is ignoring input (as ntpd's wildcard sockets do, and others can
be configured to do via "interface ___ drop" in ntp.conf), or
2. ntpd has no receive buffers available

then ntpd will use recvfrom() to a stack-based buffer (0xefffcc74
here) and discard the data so read. My hunch is ntpd is somehow
getting wedged during your cron jobs so that all receive buffers are
consumed and more can not be allocated. You can monitor the situation
using ntpq -c iostats on 4.2.7, or ntpdc -c iostats on earlier
versions. Pay attention to free receive buffers and dropped packets
(due to no buffer) in particular.

ntpd can't allocate more receive buffers safely when handling SIGIO.
That is done later, after the signal handler has returned, as a side
effect of pulling a "full" receive buffer from a list of previously
received packets for processing, if a packet had been dropped
previously due to lack of receive buffers. To debug if you've found a
corner case where that allocation code never gets called, i suggest
you try changing this code in libntp/recvbuff.c from:

isc_boolean_t has_full_recv_buffer(void)
{
if (HEAD_FIFO(full_recv_fifo) != NULL)
return (ISC_TRUE);
else
return (ISC_FALSE);
}

to

isc_boolean_t has_full_recv_buffer(void)
{
if (HEAD_FIFO(full_recv_fifo) != NULL)
return (ISC_TRUE);
else {
/* allocate more buffers if needed as side effect in
get_full_recv_buffer() (which will return NULL) */
get_full_recv_buffer();
return (ISC_FALSE);
}
}

If receive buffer shortfall is the event that triggers ntpd hanging,
you might be able to accelerate reproducing the problem by tweaking
these values in include/recvbuff.h:

/*
* recvbuf memory management
*/
#define RECV_INIT 10 /* 10 buffers initially */
#define RECV_LOWAT 3 /* when we're down to three buffers get more */
#define RECV_INC 5 /* get 5 more at a time */
#define RECV_TOOMANY 40 /* this is way too many buffers */

Good luck,
Dave Hart

A C

unread,
Nov 9, 2011, 3:44:10 PM11/9/11
to
I will make these changes later today but just for more reference,
here's an output of ntpdc -c iostats within a couple seconds of ntpd
locking up. The data was collected by a second machine polling the ntpd
server every two seconds and recording the output. The ntpd server was
also having its peer list polled by a second machine (same one as the
ntpdc) every ten seconds.

time since reset: 332
receive buffers: 10
free receive buffers: 9
used receive buffers: 0
low water refills: 1
dropped packets: 0
ignored packets: 2
received packets: 705
packets sent: 1159
packets not sent: 0
interrupts handled: 706
received by int: 704


It didn't take long to crash but I'm not able to catch the buffers in
the act of filling up. However, the increased poll rate does shorten
the life of ntpd's run time. Note this instance crashed after only 332
seconds. I was running ntpdc at a slower polling rate (every 5 seconds)
and it took about 17,000 seconds to crash in that case.

Dave Hart

unread,
Nov 9, 2011, 5:26:29 PM11/9/11
to
On Wed, Nov 9, 2011 at 20:44, A C <agcarv...@acarver.net> wrote:
> It didn't take long to crash but I'm not able to catch the buffers in the
> act of filling up.  However, the increased poll rate does shorten the life
> of ntpd's run time.  Note this instance crashed after only 332 seconds.  I
> was running ntpdc at a slower polling rate (every 5 seconds) and it took
> about 17,000 seconds to crash in that case.

I suspect to really understand what's going wrong, you'll need to run
ntpd interactively using -dd (or equivalently -D2) so that it traces
events to stdout. It's also possible using gdb without debug tracing
would allow enough poking around after the problem occurs to identify
it.

Cheers,
Dave Hart

A C

unread,
Nov 9, 2011, 6:56:49 PM11/9/11
to
I guess that's the next step. It crashed again just now even with the
code modification you suggested. In these last three cases none of the
cron jobs I mentioned some time earlier were running. The system was
idle yet ntpd crashed just the same.

I'll see how ntpd -D2 works. It shouldn't take long for the next crash.

A C

unread,
Nov 9, 2011, 11:21:04 PM11/9/11
to
On 11/9/2011 14:26, Dave Hart wrote:
As expected, it crashed again. The interesting thing is that the
constant output from -D2 stopped in its tracks. Nothing in the output
shows anything out of the ordinary. It just halts completely.

Dave Hart

unread,
Nov 9, 2011, 11:32:27 PM11/9/11
to
On Thu, Nov 10, 2011 at 04:21, A C <agcarv...@acarver.net> wrote:
> As expected, it crashed again.  The interesting thing is that the constant
> output from -D2 stopped in its tracks.  Nothing in the output shows anything
> out of the ordinary.  It just halts completely.

That's progress, and tells me a lot. We need to see where ntpd is
stuck, I think that means either attaching gdb to the hung ntpd, or
starting it again under gdb and breaking into gdb once it hangs. I'm
relatively weak on gdb, but I believe if it's started under gdb, ^C
will break into a gdb> prompt.

Once in, use "bt" to get a stack trace that hopefully goes back to
ntpdmain (that is, is hopefully on the correct ntpd thread, and not
any secondary or gdb-created thread).

Cheers,
Dave Hart

A C

unread,
Nov 10, 2011, 2:39:29 AM11/10/11
to
I'll restart it under gdb and see what happens when it hangs again.

I did have a question about the debug data out of -D2. What does the
"head" parameter indicate in a poll_update? I notice that most of the
time head is zero but once in a while it's something else. The log of
the debug data shows it's gone over 500 a couple times.

This one is zero, I lost the bigger one which
> poll_update: at 3966 127.127.22.1 poll 4 burst 0 retry 0 head 0 early 2 next 16

Dave Hart

unread,
Nov 10, 2011, 5:39:49 AM11/10/11
to
On Thu, Nov 10, 2011 at 07:39, A C <agcarv...@acarver.net> wrote:
> I did have a question about the debug data out of -D2.  What does the "head"
> parameter indicate in a poll_update?  I notice that most of the time head is
> zero but once in a while it's something else.  The log of the debug data
> shows it's gone over 500 a couple times.
>
> This one is zero, I lost the bigger one which
>>
>> poll_update: at 3966 127.127.22.1 poll 4 burst 0 retry 0 head 0 early 2
>> next 16

This is the transmit headway, reported by the "rv <associd>" or "rv
&1" (or &2 for second association, etc.) as headway=

See http://www.eecis.udel.edu/~mills/ntp/html/rate.html#guard for an
explanation. This is the output or transmit counter described.

Cheers,
Dave Hart

A C

unread,
Nov 11, 2011, 3:23:13 PM11/11/11
to
On 11/9/2011 20:32, Dave Hart wrote:
First attempt with gdb and a back trace after attaching gdb to the hung
process (note this particular running of ntpd was not using the debug
command line options):

> #0 0x103d1458 in .umul () from /usr/lib/libc.so.12
> #1 0x103c38d4 in __pow5mult_D2A () from /usr/lib/libc.so.12
> #2 0x103c3ac4 in __muldi3 () from /usr/lib/libc.so.12
> #3 0x103c34dc in __mult_D2A () from /usr/lib/libc.so.12
> #4 0x103c3728 in __pow5mult_D2A () from /usr/lib/libc.so.12
> #5 0x103b61d4 in __dtoa () from /usr/lib/libc.so.12
> #6 0x103b315c in __vfprintf_unlocked () from /usr/lib/libc.so.12
> #7 0x103230c4 in snprintf () from /usr/lib/libc.so.12
> #8 0x00023afc in ctl_putarray (tag=<value optimized out>, arr=0xa8fe0, start=1)
> at ntp_control.c:1307
> #9 0x00024a7c in ctl_putpeer (varid=30, peer=0xa8e70) at ntp_control.c:1777
> #10 0x0002744c in read_variables (rbufp=0x1050d000, restrict_mask=0) at ntp_control.c:2334
> #11 0x0002664c in process_control (rbufp=0x1050d000, restrict_mask=0) at ntp_control.c:809
> #12 0x00035594 in receive (rbufp=0x1050d000) at ntp_proto.c:370
> #13 0x00022c00 in ntpdmain (argc=<value optimized out>, argv=<value optimized out>) at ntpd.c:1150
> #14 0x0001381c in ___start ()
> #15 0x00013754 in _start ()

Dave Hart

unread,
Nov 11, 2011, 10:19:07 PM11/11/11
to
Excellent. I assume the stack trace is from ntpd 4.2.6p3. I think
you've found a bug in your system's libc dtoa() exposed by its
snprintf(s, " %.2f", ...). I believe you will not be able to
reproduce the bug using 4.2.7, as that version of ntpd uses
C99-snprintf [1] if the system snprintf() is not C99-compliant.
C99-snprintf's rpl_vsnprintf() does not use dtoa(), it hand-rolls the
double-to-ascii conversion. Below is the code in ntpd. NTP_SHIFT is
8. I claim the ntpd code is correct and your system's dtoa() and
thereby snprintf() of double (floating point) is subject to infinite
looping for some values.

I suggest we move this discussion to the appropriate NetBSD mailing
list. Please cc me, and I'll subscribe.

/*
* ctl_putarray - write a tagged eight element double array into the response
*/
static void
ctl_putarray(
const char *tag,
double *arr,
int start
)
{
register char *cp;
register const char *cq;
char buffer[200];
int i;
cp = buffer;
cq = tag;
while (*cq != '\0')
*cp++ = *cq++;
i = start;
do {
if (i == 0)
i = NTP_SHIFT;
i--;
NTP_INSIST((cp - buffer) < sizeof(buffer));
snprintf(cp, sizeof(buffer) - (cp - buffer),
" %.2f", arr[i] * 1e3);
cp += strlen(cp);
} while(i != start);
ctl_putdata(buffer, (unsigned)(cp - buffer), 0);
}

[1] http://www.jhweiss.de/software/snprintf.html

Cheers,
Dave Hart

unruh

unread,
Nov 12, 2011, 12:52:01 AM11/12/11
to
On 2011-11-12, Dave Hart <ha...@ntp.org> wrote:
> On Fri, Nov 11, 2011 at 20:23, A C <agcarv...@acarver.net> wrote:
>> First attempt with gdb and a back trace after attaching gdb to the hung
>> process (note this particular running of ntpd was not using the debug
>> command line options):
>>
>>> #0 ?0x103d1458 in .umul () from /usr/lib/libc.so.12
>>> #1 ?0x103c38d4 in __pow5mult_D2A () from /usr/lib/libc.so.12
>>> #2 ?0x103c3ac4 in __muldi3 () from /usr/lib/libc.so.12
>>> #3 ?0x103c34dc in __mult_D2A () from /usr/lib/libc.so.12
>>> #4 ?0x103c3728 in __pow5mult_D2A () from /usr/lib/libc.so.12
>>> #5 ?0x103b61d4 in __dtoa () from /usr/lib/libc.so.12
>>> #6 ?0x103b315c in __vfprintf_unlocked () from /usr/lib/libc.so.12
>>> #7 ?0x103230c4 in snprintf () from /usr/lib/libc.so.12
>>> #8 ?0x00023afc in ctl_putarray (tag=<value optimized out>, arr=0xa8fe0,
>>> start=1)
>>> ? ?at ntp_control.c:1307
>>> #9 ?0x00024a7c in ctl_putpeer (varid=30, peer=0xa8e70) at
>>> ntp_control.c:1777
>>> #10 0x0002744c in read_variables (rbufp=0x1050d000, restrict_mask=0) at
>>> ntp_control.c:2334
>>> #11 0x0002664c in process_control (rbufp=0x1050d000, restrict_mask=0) at
>>> ntp_control.c:809
>>> #12 0x00035594 in receive (rbufp=0x1050d000) at ntp_proto.c:370
>>> #13 0x00022c00 in ntpdmain (argc=<value optimized out>, argv=<value
>>> optimized out>) at ntpd.c:1150
>>> #14 0x0001381c in ___start ()
>>> #15 0x00013754 in _start ()
>
> Excellent. I assume the stack trace is from ntpd 4.2.6p3. I think
> you've found a bug in your system's libc dtoa() exposed by its
> snprintf(s, " %.2f", ...). I believe you will not be able to
> reproduce the bug using 4.2.7, as that version of ntpd uses
> C99-snprintf [1] if the system snprintf() is not C99-compliant.
> C99-snprintf's rpl_vsnprintf() does not use dtoa(), it hand-rolls the

That seems like a step backwards. Having a program write its own code
for such elementary routines is a recipie for choas. Rather, if there
is a bug in libc it should be fixed.

(Of course sometimes bugs are not fixed. An infinite loop in the
tcpwrapper routines was refused to be fixed by Verma-- there was a
workaround.
Why was that "roll your own dtoa" added to ntpd?

> double-to-ascii conversion. Below is the code in ntpd. NTP_SHIFT is
> 8. I claim the ntpd code is correct and your system's dtoa() and
> thereby snprintf() of double (floating point) is subject to infinite
> looping for some values.
>
> I suggest we move this discussion to the appropriate NetBSD mailing
> list. Please cc me, and I'll subscribe.

While adding that discussion would certainly be a good idea, until it is really
confirmed that it is a bug in dtoa it should also continue in the ntp
list as well.

Dave Hart

unread,
Nov 12, 2011, 2:41:53 AM11/12/11
to
On Sat, Nov 12, 2011 at 05:52, unruh <un...@chronyadvocacy.org> wrote:
> On 2011-11-12, Dave Hart <ha...@ntp.org> wrote:
>> Excellent.  I assume the stack trace is from ntpd 4.2.6p3.  I think
>> you've found a bug in your system's libc dtoa() exposed by its
>> snprintf(s, " %.2f", ...).  I believe you will not be able to
>> reproduce the bug using 4.2.7, as that version of ntpd uses
>> C99-snprintf [1] if the system snprintf() is not C99-compliant.
>> C99-snprintf's rpl_vsnprintf() does not use dtoa(), it hand-rolls the
>
> That seems like a step backwards. Having a program write its own code
> for  such elementary routines is a recipie for choas. Rather, if there
> is a bug  in libc it should be fixed.
> [...]
> Why was that "roll your own dtoa" added to ntpd?

It appears you misunderstood me. That's ok, I was confused myself, as
I'll mention in a separate followup. ntpd does not roll its own
dtoa(). ntpd does not roll its own [v]snprintf() either, it uses the
system/compiler provided snprintf() _unless_ that snprintf() is not
C99-compliant, in which case it uses a 3rd-party implementation of
[v]snprintf, namely, C99-snprintf (that's a proper name, as I hoped
the footnoted URL in my prior message would make clear). Why? ntpd
is highly portable code. It is not simply a component of Linux. It
is portable source code that can be built on any reasonably POSIX-like
system with an ANSI C compiler, as well as a couple of
not-particularly-POSIX-like targets, notably Windows. (ANSI standard
C, the first cross-vendor standard for C, is also confusingly referred
to variously as ISO C, C89, and C90 -- boy do I get sick of pointing
that out.) ANSI C is nearly universally supported, and has been for
years. ntpd 4.2.4 and earlier could be built on classic, pre-ANSI C
compilers, because until the last few years, there were still
interesting systems in wide use that didn't support ANSI C out of the
box, and requiring users to integrate gcc for their system would have
made it impractical to use up-to-date ntpd. Similarly, today, there
remain widely deployed systems which do not support newer C standards.
Portability to those systems mandates that if we want to rely on
improvements to the C runtime since ANSI, we must provide a fallback
implementation of the newer spec.

Why do we want a C99-compliant [v]snprintf()? Well, ANSI C snprintf()
is just horrible if the buffer is not large enough to hold the
requested output. If the requested output fits in the buffer without
zero termination, the return value is the number of characters output,
excluding zero termination. If the buffer has room for one additional
character, the string is zero terminated. In either case the return
value is the same. So, you may end up with an unterminated string
which will result in later buffer read overrun when it is used, and
you can't tell. But wait, there's more. If the output excluding
terminator does not fit in the given buffer, the return value is -1,
so the caller has no way to determine how large a buffer would be
needed. Recognizing the ANSI spec is disastrous and has led to
innumerable bugs and exploits, C99 changes snprintf() in a
backward-incompatible way, requiring (a) the output is always
terminated and (b) the return value is always the number of characters
that would be output (excluding terminator) if there were sufficient
room.

Using ANSI C snprintf() while not inviting bugs and exploits means
ensuring every use of snprintf() is followed by explicit code to set
the last character of the buffer to zero to terminate it, or
alternatively to audit every use of [v]snprintf() to ensure lack of
termination can not occur. Neither option is acceptable, so as with a
number of other less-than-universally-provided routines (MD5 hash,
strdup, strlcpy, strlcat), ntpd source includes fallback
implementations. System-provided routines are used if available,
avoiding binary bloat from duplication and limiting the use of
less-widely-tested code to the systems that need it.

C99-snprintf's rpl_vsnprintf() code (distributed in libntp/snprintf.c)
does not rely on any other C runtime functions such as dtoa(). If you
believe that to be a bad design, please provide a better [v]snprintf()
implementation under a BSD/MIT-style license that does not place
requirements on NTP's documentation, and I'll be happy to consider
using it for the fallback implementation used with Microsoft's C
runtime and others lacking C99 snprintf() semantics.

Cheers,
Dave Hart

Dave Hart

unread,
Nov 12, 2011, 3:01:14 AM11/12/11
to
On Sat, Nov 12, 2011 at 03:19, Dave Hart <ha...@ntp.org> wrote:
> Excellent.  I assume the stack trace is from ntpd 4.2.6p3.  I think
> you've found a bug in your system's libc dtoa() exposed by its
> snprintf(s, " %.2f", ...).  I believe you will not be able to
> reproduce the bug using 4.2.7, as that version of ntpd uses
> C99-snprintf [1] if the system snprintf() is not C99-compliant.
> C99-snprintf's rpl_vsnprintf() does not use dtoa(), it hand-rolls the
> double-to-ascii conversion.  Below is the code in ntpd.  NTP_SHIFT is
> 8.  I claim the ntpd code is correct and your system's dtoa() and
> thereby snprintf() of double (floating point) is subject to infinite
> looping for some values.

I was confused. Thinking of the antique hardware involved, I was
assuming wrongly that the system-provided snprintf() was not
C99-compliant. In fact, the antique is running modern NetBSD 5.x,
which undoubtedly does provide a C99 snprintf().

But wait, there's more. I was also probably wrong in saying that this
is a bug in 32-bit SPARC NetBSD 5.x dtoa(). That code is likely quite
mature and such a bug would have been found before now. My hunch is a
hardware bug, but I'll pursue that on the port-...@netbsd.org list.
I will report back here once it's resolved, if Mr. Carver does not.

Cheers,
Dave Hart

A C

unread,
Nov 12, 2011, 3:08:14 AM11/12/11
to
I'll gladly report back to the list if the bug is discovered wherever it
might reside. However, if there's a hardware bug affecting dtoa() then
it should manifest in the gpsd code that is also running on the system
and also using snprintf(). The first question is to determine the
conditions that cause a runaway dtoa() since it seems to only affect one
program and not the other.

Dave Hart

unread,
Nov 17, 2011, 3:42:00 AM11/17/11
to
For the original poster, this step backwards worked around the
problem. For the benefit of threaded archives, acarver reported in
another thread [1][2][3] that using ntpd 4.2.7 configured with
--enable-c99-snprintf to force the use of C99-snprintf's replacement
code has made the problem disappear, apparently confirming the problem
is in the NetBSD sparc32 dtoa().

Agreed the bug in the C runtime should be fixed. It turned out having
an alternative implementation of snprintf() easily available made it
easy to confirm the bug is in the C runtime, though.

[1] http://lists.ntp.org/pipermail/questions/2011-November/030983.html
-or-
[2] http://groups.google.com/group/comp.protocols.time.ntp/msg/8e18eba15c70db7f
-or-
[3] comp.protocols.time.ntp Message-ID: <4EC4B838...@acarver.net>

Cheers,
Dave Hart
Reply all
Reply to author
Forward
0 new messages