All my servers and workstations running a 2.6.21.5 kernel hanged exactly
when the date shift from june 30th to july 1st.
On my monitoring system every single station running a 2.6.21.5 kernel
stoped responding exactly after midnight on the date shift from June
30th to July 1st. Although, stations still running 2.6.18 to 2.6.20.11
worked flawlessly.
I first tought there had been an electricity outage but two of my
servers (dell PE 2950 dual-quad core) on UPS in our server room also
hanged:
Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
/usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
"$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
Jul 3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
I could not get anything on any of the 20+ consoles... All the systems
hanged at around the exact same time... When the date shifted from June
30th to July 1st in UTC ...?
Any clue any one?
- vin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Forgot to mention:
- All stations that failed where running a 2.6.21 kernel + CFS v18 (I don't have any stations running a plain 2.6.21 kernel so can't tell)
- Config file can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/CONFIG-i686-2.6.21-005
- kernels can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/sarge/i686/2.6.21/
Fortier,Vincent [Montreal] schrieb:
> Hi all,
>
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.
>
> On my monitoring system every single station running a 2.6.21.5 kernel
> stoped responding exactly after midnight on the date shift from June
> 30th to July 1st. Although, stations still running 2.6.18 to 2.6.20.11
> worked flawlessly.
>
> I first tought there had been an electricity outage but two of my
> servers (dell PE 2950 dual-quad core) on UPS in our server room also
> hanged:
> Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
> /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
> "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
> Jul 3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
>
> I could not get anything on any of the 20+ consoles... All the systems
> hanged at around the exact same time... When the date shifted from June
> 30th to July 1st in UTC ...?
>
> Any clue any one?
No problems over here with plain 2.6.21.5 on x686
You could just reset the date back on one of these machines and
check the transition again... and see if it was really the kernel
who crashed.... and check your cron configuration.
Regards,
--
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com
regards,
Uli
--
------- ROAD ...the handyPC Company - - - ) ) )
Uli Luckas
Software Development
ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de
Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit
I tried reverting to 23:50 June 30th and it did not hanged when switching to July 1st. I've tried it a few times without any problems. So I deactivated all NTP related jobs, switched the date back to June 30th 23:10 and reboot the system. Wait and see.
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de Uli Luckas
>
> Same thing here on two machines with plain vanilla
> 2.6.21.(3/4), on debian testing & debian unstable.
>
I am also using Debian... But Sarge 3.1. There might be a relation there unless somebody comes up having the same problem using another dist. At least it's not CFS related.
> -----Message d'origine-----
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de
> Arne Georg Gleditsch
>
> Florian Attenberger <val...@gmail.com> writes:
> > there was one 'special' event at that date:
> > syslog.2.gz:Jul 1 01:59:59 master kernel: Clock: inserting leap
> > second 23:59:60 UTC
>
> As far as I can tell, no leap second was due to be inserted
> at 1. of July this year. Is the year set correctly for this box?
>
All my server/workstations are in sync using ntp... And yes, the year is set properly on all of them.
Regards,
- vin
I'm not all that versed in ntp-ish, but it appears that the leap
second insertion should be propagated through the ntp protocol.
Whether the leap second in question came from a ntp server giving out
wrong data or from a misinterpretation or bug in ntpd is of course
hard to say, but either way turning the clock back is unlikely to
reconstruct the circumstances. An interesting exercise might be to
code up a small program to call adjtimex with timex.status |= STA_INS,
to see if this can trigger the problem. (The bogus leap second might
be a red herring entirely, of course...)
--
Arne.
You are probably right, I did tried to reproduce the problem without
success...
Although it is wierd that it happend only on 2.6.21 kernels... It did
not happend on any of my workstations/servers running either 2.6.18 or
2.6.20.
Could dynticks be involved?
- vin
Interesting, I just sent out an email for a similar issue, but with a pair of
2.6.10 machines.
I'm wondering if its related to a spurious leap second event...
Chris
Just wondering, what is your distribution?
- vin
>>Interesting, I just sent out an email for a similar issue,
>>but with a pair of 2.6.10 machines.
>>
>>I'm wondering if its related to a spurious leap second event...
> Just wondering, what is your distribution?
We're based off a WindRiver PNE-LE distribution.
Chris
I saw it on a box that happened to have lockdep enabled.
(I run it everywhere thankfully). This is what it looked like..
http://www.codemonkey.org.uk/junk/img_0421.jpg
Dave
--
http://www.codemonkey.org.uk
I'm trying to get a console on the affected system to query the leap second info
from the ntp servers.
However, just for kicks I queried the local servers for my desktop following the
instructions that I found on a thread about spurious leap second notifications.
Interestingly, two of the associations show non-zero leap values...
Chris
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -cas zcars0vr
ind assID status conf reach auth condition last_event cnt
===========================================================
1 62124 b614 yes yes none sys.peer reachable 1
2 62125 b4f4 yes yes none candidat reachable 15
3 62126 b314 yes yes none outlyer reachable 1
4 62127 b314 yes yes none outlyer reachable 1
5 62128 8000 yes yes none reject
6 62129 b434 yes yes none candidat reachable 3
7 62130 b424 yes yes none candidat reachable 2
8 62131 a0f3 yes yes none reject lost reach 15
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62124 leap" zcars0vr
assID=62124 status=b614 reach, conf, sel_sys.peer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62125 leap" zcars0vr
assID=62125 status=b0f4 reach, conf, 15 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62126 leap" zcars0vr
assID=62126 status=b314 reach, conf, sel_outlyer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62127 leap" zcars0vr
assID=62127 status=b414 reach, conf, sel_candidat, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62128 leap" zcars0vr
assID=62128 status=8000 unreach, conf, no events,
leap=11
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62129 leap" zcars0vr
assID=62129 status=b434 reach, conf, sel_candidat, 3 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62130 leap" zcars0vr
assID=62130 status=b424 reach, conf, sel_candidat, 2 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62131 leap" zcars0vr
assID=62131 status=a0f3 unreach, conf, 15 events, event_unreach,
leap=11
> An interesting exercise might be to
> code up a small program to call adjtimex with timex.status |= STA_INS,
> to see if this can trigger the problem.
Setting the date to just before midnight June 30 UTC and then running the
following as root triggered the crash on a modified 2.6.10. Anyone see anything
wrong with the code below, or is this a valid indication of a bug in the leap
second code?
Chris
#include <sys/timex.h>
#include <stdio.h>
#include <errno.h>
struct timex buf;
int main(void)
{
int rc = adjtimex(&buf);
printf("initial status: 0x%x\n", buf.status);
buf.status |= STA_INS;
buf.modes = ADJ_STATUS;
rc = adjtimex(&buf);
if (rc == -1) {
printf("unable to set status: %m\n");
return -1;
} else
printf("rc: %d\n", rc);
printf("final status: 0x%x\n", buf.status);
return 0;
> Setting the date to just before midnight June 30 UTC and then running
> the following as root triggered the crash on a modified 2.6.10. Anyone
> see anything wrong with the code below, or is this a valid indication of
> a bug in the leap second code?
As a further data point, the test app triggers problems on x86 uniprocessor and
SMP as well as arm uniprocessor. On ppc64 we see the leap second being added,
but it doesn't hang, while on ppc we don't even see the leap second being
added--leading me to wonder if the leap second code even works for ppc32.
The above is all for modified 2.6.10.
Chris
Thanx a lot! This was fast! (beat that closed source!)
- vin
regards,
Uli
--
------- ROAD ...the handyPC Company - - - ) ) )
Uli Luckas
Software Development
ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de
Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit
Yes, it has already been sent to -stable.
thanks,
-chris
Chris Wright schrieb:
> * Uli Luckas (u.lu...@road.de) wrote:
>> On Tuesday, 3. July 2007, Chuck Ebbert wrote:
>>> On 07/03/2007 03:28 PM, Chris Friesen wrote:
>>>> Arne Georg Gleditsch wrote:
>>>>> An interesting exercise might be to
>>>>> code up a small program to call adjtimex with timex.status |= STA_INS,
>>>>> to see if this can trigger the problem.
>>>> Setting the date to just before midnight June 30 UTC and then running
>>>> the following as root triggered the crash on a modified 2.6.10. Anyone
>>>> see anything wrong with the code below, or is this a valid indication of
>>>> a bug in the leap second code?
>>> Fixed:
>>> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
>>> itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
>>>
>> Hi Chris,
>> does that qualify for inclusion into 2.6.21.6?
>
> Yes, it has already been sent to -stable.
Okay, we all survived Y2K and this little glitch. Puh! ;-)
Can you please explain in which configuration this problem got triggered.
Does it make sense to have some testing environments which have the date
set to about one month in the future to catch any crashes like that,
preventing machines in production from failing?!
Best regards,
--
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com
> Okay, we all survived Y2K and this little glitch. Puh! ;-)
> Can you please explain in which configuration this problem got triggered.
As far as I can tell many kernel versions contained the source code bug.
(I'd like some more information on exactly what the problem was if
anyone cares to share..the proposed patch didn't give much in the way of
specifics.)
However, in order to trigger the problem you also need to have NTP
servers that were erroneously broadcasting the addition of a leap second.
So most people didn't see the issue because there wasn't supposed to be
a leap second added this year...but they would have seen it the next
time a leap second was added.
Chris
No matter what NTP servers send, it shouldn't result in a DoS.
> So most people didn't see the issue because there wasn't supposed to be
> a leap second added this year...but they would have seen it the next
> time a leap second was added.
True. It seems like we will have another one next year.
Regards,
--
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com
It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.
tglx
> It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
> is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.
Are you certain?
Vanilla 2.6.10 shows a clock_was_set() function. Does it just not call
the dangerous code or something?
Also, our modified 2.6.10 has the high res timers patch applied, but the
config option is turned off and we were still affected.
Chris
At least for anything >= 2.6.16
> Vanilla 2.6.10 shows a clock_was_set() function. Does it just not call
> the dangerous code or something?
Ouch, the old posix timer code might be affected as well, but I did not
look.
> Also, our modified 2.6.10 has the high res timers patch applied, but the
> config option is turned off and we were still affected.
You mean Anzingers high res patches. No idea about those.
tglx
> Clemens Koller wrote:
>
> > Okay, we all survived Y2K and this little glitch. Puh! ;-)
> > Can you please explain in which configuration this problem got triggered.
>
> As far as I can tell many kernel versions contained the source code bug.
> (I'd like some more information on exactly what the problem was if
> anyone cares to share..the proposed patch didn't give much in the way of
> specifics.)
>
> However, in order to trigger the problem you also need to have NTP
> servers that were erroneously broadcasting the addition of a leap second.
>
> So most people didn't see the issue because there wasn't supposed to be
> a leap second added this year...but they would have seen it the next
> time a leap second was added.
Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
vulnerable.
Cheers. -ernie
> Ernie Petrides wrote:
>
> > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > vulnerable.
>
> As I mentioned in my post to Thomas, we have high res timers disabled
> and were still affected. Granted, our kernel has been modified so it is
> possible that vanilla would not be affected....I haven't tested it.
>
> Chris
That's odd, because Thomas's patch removed two calls to clock_was_set(),
which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
the 2.6.21 source tree).
Also, I personally tested with the reproducer you posted here, initially
on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
what config options were enabled on that kernel). I did reproduce the
problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.
> That's odd, because Thomas's patch removed two calls to clock_was_set(),
> which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> the 2.6.21 source tree).
I'm using a modified 2.6.10 tree...I expect the timer code is different.
Chris
Way different and you have extra patches on top.
tglx
It needs a running smp_call_function() to be interrupted by the timer
interrupt, which calls clock_was_set(). So it's not that easy to
reproduce.
tglx
> It needs a running smp_call_function() to be interrupted by the timer
> interrupt, which calls clock_was_set(). So it's not that easy to
> reproduce.
On our 2.6.10-based kernel its basically trivial to reproduce, and the
posted fix doesn't solve the issue.
One of our guys is trying to track it down. As yet we don't know if it's
the vanilla code or the patches on top that contain the bug.
Chris
> On Thu, 2007-07-05 at 19:12 -0400, Ernie Petrides wrote:
> > On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:
> >
> > > Ernie Petrides wrote:
> > >
> > > > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > > > vulnerable.
> > >
> > > As I mentioned in my post to Thomas, we have high res timers disabled
> > > and were still affected. Granted, our kernel has been modified so it is
> > > possible that vanilla would not be affected....I haven't tested it.
> > >
> > > Chris
> >
> > That's odd, because Thomas's patch removed two calls to clock_was_set(),
> > which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> > the 2.6.21 source tree).
> >
> > Also, I personally tested with the reproducer you posted here, initially
> > on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
> > what config options were enabled on that kernel). I did reproduce the
> > problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.
>
> It needs a running smp_call_function() to be interrupted by the timer
> interrupt, which calls clock_was_set(). So it's not that easy to
> reproduce.
I think it's reproducible at will when CONFIG_BUG is enabled, because the
WARN_ON() on line 546 of arch/i386/kernel/smp.c fires in smp_call_function(),
causing lots of console output. By the time on_each_cpu() later reenables
interrupts, another clock interrupt is pending, and (I think) causes a
self-deadlock on the xtime_lock in vmi_account_real_cycles().
That's my (unproven) theory, anyway. :)
At any rate, I reproduced it twice in two tries on stock 2.6.21.
Cheers. -ernie