Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

2.6.21.5 june 30th to july 1st date hang?

7 views
Skip to first unread message

Fortier,Vincent [Montreal]

unread,
Jul 3, 2007, 8:44:48 AM7/3/07
to Linux Kernel Mailing List
Hi all,

All my servers and workstations running a 2.6.21.5 kernel hanged exactly
when the date shift from june 30th to july 1st.

On my monitoring system every single station running a 2.6.21.5 kernel
stoped responding exactly after midnight on the date shift from June
30th to July 1st. Although, stations still running 2.6.18 to 2.6.20.11
worked flawlessly.

I first tought there had been an electricity outage but two of my
servers (dell PE 2950 dual-quad core) on UPS in our server room also
hanged:
Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
/usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
"$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
Jul 3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.

I could not get anything on any of the 20+ consoles... All the systems
hanged at around the exact same time... When the date shifted from June
30th to July 1st in UTC ...?

Any clue any one?

- vin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Fortier,Vincent [Montreal]

unread,
Jul 3, 2007, 8:55:53 AM7/3/07
to Linux Kernel Mailing List
> -----Message d'origine-----
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de
> Fortier,Vincent [Montreal]
> Envoyé : 3 juillet 2007 08:44

>
> Hi all,
>
> All my servers and workstations running a 2.6.21.5 kernel
> hanged exactly when the date shift from june 30th to july 1st.
>
> On my monitoring system every single station running a
> 2.6.21.5 kernel stoped responding exactly after midnight on
> the date shift from June 30th to July 1st. Although,
> stations still running 2.6.18 to 2.6.20.11 worked flawlessly.
>
> I first tought there had been an electricity outage but two
> of my servers (dell PE 2950 dual-quad core) on UPS in our
> server room also
> hanged:
> Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
> /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT"
> ; [ "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
> Jul 3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
>
> I could not get anything on any of the 20+ consoles... All
> the systems hanged at around the exact same time... When the
> date shifted from June 30th to July 1st in UTC ...?
>
> Any clue any one?

Forgot to mention:

- All stations that failed where running a 2.6.21 kernel + CFS v18 (I don't have any stations running a plain 2.6.21 kernel so can't tell)
- Config file can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/CONFIG-i686-2.6.21-005
- kernels can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/sarge/i686/2.6.21/

Clemens Koller

unread,
Jul 3, 2007, 9:06:02 AM7/3/07
to Fortier,Vincent [Montreal], Linux Kernel Mailing List
Hi, Vincent!

Fortier,Vincent [Montreal] schrieb:


> Hi all,
>
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.
>
> On my monitoring system every single station running a 2.6.21.5 kernel
> stoped responding exactly after midnight on the date shift from June
> 30th to July 1st. Although, stations still running 2.6.18 to 2.6.20.11
> worked flawlessly.
>
> I first tought there had been an electricity outage but two of my
> servers (dell PE 2950 dual-quad core) on UPS in our server room also
> hanged:
> Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
> /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
> "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
> Jul 3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
>
> I could not get anything on any of the 20+ consoles... All the systems
> hanged at around the exact same time... When the date shifted from June
> 30th to July 1st in UTC ...?
>
> Any clue any one?

No problems over here with plain 2.6.21.5 on x686

You could just reset the date back on one of these machines and
check the transition again... and see if it was really the kernel
who crashed.... and check your cron configuration.

Regards,
--
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

Uli Luckas

unread,
Jul 3, 2007, 9:56:51 AM7/3/07
to LKML, Fortier,Vincent [Montreal]
On Tuesday, 3. July 2007, Fortier,Vincent [Montreal] wrote:
> Hi all,
>
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.
>
Same thing here on two machines with plain vanilla 2.6.21.(3/4), on debian
testing & debian unstable.

regards,
Uli

--

------- ROAD ...the handyPC Company - - - ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit

Florian Attenberger

unread,
Jul 3, 2007, 10:00:12 AM7/3/07
to linux-...@vger.kernel.org
there was one 'special' event at that date:
syslog.2.gz:Jul 1 01:59:59 master kernel: Clock: inserting leap second
23:59:60 UTC

Fortier,Vincent [Montreal]

unread,
Jul 3, 2007, 10:52:46 AM7/3/07
to Clemens Koller, Linux Kernel Mailing List, Uli Luckas, Florian Attenberger, Arne Georg Gleditsch
> -----Message d'origine-----
> De : Clemens Koller [mailto:clemens...@anagramm.de]
> Envoyé : 3 juillet 2007 09:05

>
> Hi, Vincent!
>
> Fortier,Vincent [Montreal] schrieb:
> > Hi all,
> >
> > All my servers and workstations running a 2.6.21.5 kernel hanged
> > exactly when the date shift from june 30th to july 1st.
> >
> > I could not get anything on any of the 20+ consoles... All the
> > systems hanged at around the exact same time... When the date shifted
> > from June 30th to July 1st in UTC ...?
> >
> > Any clue any one?
>
> No problems over here with plain 2.6.21.5 on x686
>
> You could just reset the date back on one of these machines
> and check the transition again... and see if it was really
> the kernel who crashed.... and check your cron configuration.

I tried reverting to 23:50 June 30th and it did not hanged when switching to July 1st. I've tried it a few times without any problems. So I deactivated all NTP related jobs, switched the date back to June 30th 23:10 and reboot the system. Wait and see.

> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de Uli Luckas


>
> Same thing here on two machines with plain vanilla
> 2.6.21.(3/4), on debian testing & debian unstable.
>

I am also using Debian... But Sarge 3.1. There might be a relation there unless somebody comes up having the same problem using another dist. At least it's not CFS related.

> -----Message d'origine-----
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de

> Arne Georg Gleditsch


>
> Florian Attenberger <val...@gmail.com> writes:
> > there was one 'special' event at that date:
> > syslog.2.gz:Jul 1 01:59:59 master kernel: Clock: inserting leap
> > second 23:59:60 UTC
>

> As far as I can tell, no leap second was due to be inserted
> at 1. of July this year. Is the year set correctly for this box?
>

All my server/workstations are in sync using ntp... And yes, the year is set properly on all of them.

Regards,

- vin

Florian Attenberger

unread,
Jul 3, 2007, 11:02:46 AM7/3/07
to Arne Georg Gleditsch, linux-...@vger.kernel.org
On Tue, Jul 03, 2007 at 04:20:17PM +0200, Arne Georg Gleditsch wrote:
> Florian Attenberger <val...@gmail.com> writes:
> > there was one 'special' event at that date:
> > syslog.2.gz:Jul 1 01:59:59 master kernel: Clock: inserting leap second
> > 23:59:60 UTC
>
> As far as I can tell, no leap second was due to be inserted at 1. of
> July this year. Is the year set correctly for this box?
>
yep, controlled by ntpd.
You're right according to
ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
that event shouldn't have been there.

Arne Georg Gleditsch

unread,
Jul 3, 2007, 11:26:52 AM7/3/07
to Florian Attenberger, linux-...@vger.kernel.org
Florian Attenberger <val...@gmail.com> writes:
> yep, controlled by ntpd.
> You're right according to
> ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
> that event shouldn't have been there.

I'm not all that versed in ntp-ish, but it appears that the leap
second insertion should be propagated through the ntp protocol.
Whether the leap second in question came from a ntp server giving out
wrong data or from a misinterpretation or bug in ntpd is of course
hard to say, but either way turning the clock back is unlikely to
reconstruct the circumstances. An interesting exercise might be to
code up a small program to call adjtimex with timex.status |= STA_INS,
to see if this can trigger the problem. (The bogus leap second might
be a red herring entirely, of course...)

--
Arne.

Fortier,Vincent [Montreal]

unread,
Jul 3, 2007, 11:37:41 AM7/3/07
to Arne Georg Gleditsch, Florian Attenberger, linux-...@vger.kernel.org
> -----Message d'origine-----
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de
> Arne Georg Gleditsch
>
> Florian Attenberger <val...@gmail.com> writes:
> > yep, controlled by ntpd.
> > You're right according to
> > ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
> > that event shouldn't have been there.
>
> I'm not all that versed in ntp-ish, but it appears that the
> leap second insertion should be propagated through the ntp protocol.
> Whether the leap second in question came from a ntp server
> giving out wrong data or from a misinterpretation or bug in
> ntpd is of course hard to say, but either way turning the
> clock back is unlikely to reconstruct the circumstances. An
> interesting exercise might be to code up a small program to
> call adjtimex with timex.status |= STA_INS, to see if this
> can trigger the problem. (The bogus leap second might be a
> red herring entirely, of course...)

You are probably right, I did tried to reproduce the problem without
success...

Although it is wierd that it happend only on 2.6.21 kernels... It did
not happend on any of my workstations/servers running either 2.6.18 or
2.6.20.

Could dynticks be involved?

- vin

Chris Friesen

unread,
Jul 3, 2007, 11:59:34 AM7/3/07
to Fortier,Vincent [Montreal], Linux Kernel Mailing List
Fortier,Vincent [Montreal] wrote:
> Hi all,
>
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.

Interesting, I just sent out an email for a similar issue, but with a pair of
2.6.10 machines.

I'm wondering if its related to a spurious leap second event...

Chris

Fortier,Vincent [Montreal]

unread,
Jul 3, 2007, 12:02:58 PM7/3/07
to Chris Friesen, Linux Kernel Mailing List
> -----Message d'origine-----
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de
> Chris Friesen

>
> Fortier,Vincent [Montreal] wrote:
> > Hi all,
> >
> > All my servers and workstations running a 2.6.21.5 kernel hanged
> > exactly when the date shift from june 30th to july 1st.
>
> Interesting, I just sent out an email for a similar issue,
> but with a pair of 2.6.10 machines.
>
> I'm wondering if its related to a spurious leap second event...
>

Just wondering, what is your distribution?

- vin

Chris Friesen

unread,
Jul 3, 2007, 12:05:10 PM7/3/07
to Fortier,Vincent [Montreal], Linux Kernel Mailing List
Fortier,Vincent [Montreal] wrote:
>>-----Message d'origine-----
>>De : linux-ker...@vger.kernel.org
>>[mailto:linux-ker...@vger.kernel.org] De la part de
>>Chris Friesen

>>Interesting, I just sent out an email for a similar issue,

>>but with a pair of 2.6.10 machines.
>>
>>I'm wondering if its related to a spurious leap second event...

> Just wondering, what is your distribution?

We're based off a WindRiver PNE-LE distribution.

Chris

Dave Jones

unread,
Jul 3, 2007, 1:19:49 PM7/3/07
to Fortier,Vincent [Montreal], Arne Georg Gleditsch, Florian Attenberger, linux-...@vger.kernel.org

I saw it on a box that happened to have lockdep enabled.
(I run it everywhere thankfully). This is what it looked like..
http://www.codemonkey.org.uk/junk/img_0421.jpg

Dave

--
http://www.codemonkey.org.uk

Chris Friesen

unread,
Jul 3, 2007, 1:28:56 PM7/3/07
to Fortier,Vincent [Montreal], Linux Kernel Mailing List
Some more information....

I'm trying to get a console on the affected system to query the leap second info
from the ntp servers.

However, just for kicks I queried the local servers for my desktop following the
instructions that I found on a thread about spurious leap second notifications.
Interestingly, two of the associations show non-zero leap values...

Chris

[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -cas zcars0vr

ind assID status conf reach auth condition last_event cnt
===========================================================
1 62124 b614 yes yes none sys.peer reachable 1
2 62125 b4f4 yes yes none candidat reachable 15
3 62126 b314 yes yes none outlyer reachable 1
4 62127 b314 yes yes none outlyer reachable 1
5 62128 8000 yes yes none reject
6 62129 b434 yes yes none candidat reachable 3
7 62130 b424 yes yes none candidat reachable 2
8 62131 a0f3 yes yes none reject lost reach 15


[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62124 leap" zcars0vr
assID=62124 status=b614 reach, conf, sel_sys.peer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62125 leap" zcars0vr
assID=62125 status=b0f4 reach, conf, 15 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62126 leap" zcars0vr
assID=62126 status=b314 reach, conf, sel_outlyer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62127 leap" zcars0vr
assID=62127 status=b414 reach, conf, sel_candidat, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62128 leap" zcars0vr
assID=62128 status=8000 unreach, conf, no events,
leap=11
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62129 leap" zcars0vr
assID=62129 status=b434 reach, conf, sel_candidat, 3 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62130 leap" zcars0vr
assID=62130 status=b424 reach, conf, sel_candidat, 2 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62131 leap" zcars0vr
assID=62131 status=a0f3 unreach, conf, 15 events, event_unreach,
leap=11

Chris Friesen

unread,
Jul 3, 2007, 3:29:06 PM7/3/07
to Arne Georg Gleditsch, Florian Attenberger, linux-...@vger.kernel.org
Arne Georg Gleditsch wrote:

> An interesting exercise might be to
> code up a small program to call adjtimex with timex.status |= STA_INS,
> to see if this can trigger the problem.

Setting the date to just before midnight June 30 UTC and then running the
following as root triggered the crash on a modified 2.6.10. Anyone see anything
wrong with the code below, or is this a valid indication of a bug in the leap
second code?

Chris


#include <sys/timex.h>
#include <stdio.h>
#include <errno.h>

struct timex buf;
int main(void)
{
int rc = adjtimex(&buf);
printf("initial status: 0x%x\n", buf.status);
buf.status |= STA_INS;
buf.modes = ADJ_STATUS;
rc = adjtimex(&buf);
if (rc == -1) {
printf("unable to set status: %m\n");
return -1;
} else
printf("rc: %d\n", rc);
printf("final status: 0x%x\n", buf.status);
return 0;

Chris Friesen

unread,
Jul 3, 2007, 5:03:15 PM7/3/07
to Arne Georg Gleditsch, Florian Attenberger, linux-...@vger.kernel.org
Chris Friesen wrote:

> Setting the date to just before midnight June 30 UTC and then running
> the following as root triggered the crash on a modified 2.6.10. Anyone
> see anything wrong with the code below, or is this a valid indication of
> a bug in the leap second code?


As a further data point, the test app triggers problems on x86 uniprocessor and
SMP as well as arm uniprocessor. On ppc64 we see the leap second being added,
but it doesn't hang, while on ppc we don't even see the leap second being
added--leading me to wonder if the leap second code even works for ppc32.

The above is all for modified 2.6.10.

Chris

Chuck Ebbert

unread,
Jul 3, 2007, 5:03:28 PM7/3/07
to Chris Friesen, Arne Georg Gleditsch, Florian Attenberger, linux-...@vger.kernel.org
On 07/03/2007 03:28 PM, Chris Friesen wrote:
> Arne Georg Gleditsch wrote:
>
>> An interesting exercise might be to
>> code up a small program to call adjtimex with timex.status |= STA_INS,
>> to see if this can trigger the problem.
>
> Setting the date to just before midnight June 30 UTC and then running
> the following as root triggered the crash on a modified 2.6.10. Anyone
> see anything wrong with the code below, or is this a valid indication of
> a bug in the leap second code?
>

Fixed:
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b

Fortier,Vincent [Montreal]

unread,
Jul 3, 2007, 9:06:56 PM7/3/07
to Chuck Ebbert, Chris Friesen, Arne Georg Gleditsch, Florian Attenberger, linux-...@vger.kernel.org
> -----Message d'origine-----
> De : linux-ker...@vger.kernel.org
> [mailto:linux-ker...@vger.kernel.org] De la part de Chuck Ebbert
> Envoyé : 3 juillet 2007 17:03

>
> On 07/03/2007 03:28 PM, Chris Friesen wrote:
> > Arne Georg Gleditsch wrote:
> >
> >> An interesting exercise might be to
> >> code up a small program to call adjtimex with timex.status |=
> >> STA_INS, to see if this can trigger the problem.
> >
> > Setting the date to just before midnight June 30 UTC and then running
> > the following as root triggered the crash on a modified 2.6.10.
> > Anyone see anything wrong with the code below, or is this a valid
> > indication of a bug in the leap second code?
> >
>
> Fixed:
> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2
> .6.git;a=commitdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b

Thanx a lot! This was fast! (beat that closed source!)

- vin

Uli Luckas

unread,
Jul 4, 2007, 4:57:23 AM7/4/07
to Chris Wright, LKML
On Tuesday, 3. July 2007, Chuck Ebbert wrote:
> On 07/03/2007 03:28 PM, Chris Friesen wrote:
> > Arne Georg Gleditsch wrote:
> >> An interesting exercise might be to
> >> code up a small program to call adjtimex with timex.status |= STA_INS,
> >> to see if this can trigger the problem.
> >
> > Setting the date to just before midnight June 30 UTC and then running
> > the following as root triggered the crash on a modified 2.6.10. Anyone
> > see anything wrong with the code below, or is this a valid indication of
> > a bug in the leap second code?
>
> Fixed:
> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
>itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
>
Hi Chris,
does that qualify for inclusion into 2.6.21.6?

regards,
Uli

--

------- ROAD ...the handyPC Company - - - ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit

Chris Wright

unread,
Jul 4, 2007, 12:53:58 PM7/4/07
to Uli Luckas, Chris Wright, LKML
* Uli Luckas (u.lu...@road.de) wrote:
> On Tuesday, 3. July 2007, Chuck Ebbert wrote:
> > On 07/03/2007 03:28 PM, Chris Friesen wrote:
> > > Arne Georg Gleditsch wrote:
> > >> An interesting exercise might be to
> > >> code up a small program to call adjtimex with timex.status |= STA_INS,
> > >> to see if this can trigger the problem.
> > >
> > > Setting the date to just before midnight June 30 UTC and then running
> > > the following as root triggered the crash on a modified 2.6.10. Anyone
> > > see anything wrong with the code below, or is this a valid indication of
> > > a bug in the leap second code?
> >
> > Fixed:
> > http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
> >itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
> >
> Hi Chris,
> does that qualify for inclusion into 2.6.21.6?

Yes, it has already been sent to -stable.

thanks,
-chris

Clemens Koller

unread,
Jul 5, 2007, 10:14:41 AM7/5/07
to Chris Wright, Uli Luckas, LKML
Hello, Chris!

Chris Wright schrieb:


> * Uli Luckas (u.lu...@road.de) wrote:
>> On Tuesday, 3. July 2007, Chuck Ebbert wrote:
>>> On 07/03/2007 03:28 PM, Chris Friesen wrote:
>>>> Arne Georg Gleditsch wrote:
>>>>> An interesting exercise might be to
>>>>> code up a small program to call adjtimex with timex.status |= STA_INS,
>>>>> to see if this can trigger the problem.
>>>> Setting the date to just before midnight June 30 UTC and then running
>>>> the following as root triggered the crash on a modified 2.6.10. Anyone
>>>> see anything wrong with the code below, or is this a valid indication of
>>>> a bug in the leap second code?
>>> Fixed:
>>> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
>>> itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
>>>
>> Hi Chris,
>> does that qualify for inclusion into 2.6.21.6?
>
> Yes, it has already been sent to -stable.

Okay, we all survived Y2K and this little glitch. Puh! ;-)
Can you please explain in which configuration this problem got triggered.

Does it make sense to have some testing environments which have the date
set to about one month in the future to catch any crashes like that,
preventing machines in production from failing?!

Best regards,


--
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

Chris Friesen

unread,
Jul 5, 2007, 1:49:01 PM7/5/07
to Clemens Koller, Chris Wright, Uli Luckas, LKML
Clemens Koller wrote:

> Okay, we all survived Y2K and this little glitch. Puh! ;-)
> Can you please explain in which configuration this problem got triggered.

As far as I can tell many kernel versions contained the source code bug.
(I'd like some more information on exactly what the problem was if
anyone cares to share..the proposed patch didn't give much in the way of
specifics.)

However, in order to trigger the problem you also need to have NTP
servers that were erroneously broadcasting the addition of a leap second.

So most people didn't see the issue because there wasn't supposed to be
a leap second added this year...but they would have seen it the next
time a leap second was added.

Chris

Clemens Koller

unread,
Jul 5, 2007, 2:34:50 PM7/5/07
to Chris Friesen, Chris Wright, Uli Luckas, LKML
Chris Friesen schrieb:

> However, in order to trigger the problem you also need to have NTP
> servers that were erroneously broadcasting the addition of a leap second.

No matter what NTP servers send, it shouldn't result in a DoS.

> So most people didn't see the issue because there wasn't supposed to be
> a leap second added this year...but they would have seen it the next
> time a leap second was added.

True. It seems like we will have another one next year.

Regards,


--
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

Thomas Gleixner

unread,
Jul 5, 2007, 4:10:46 PM7/5/07
to Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Thu, 2007-07-05 at 11:48 -0600, Chris Friesen wrote:
> Clemens Koller wrote:
>
> > Okay, we all survived Y2K and this little glitch. Puh! ;-)
> > Can you please explain in which configuration this problem got triggered.
>
> As far as I can tell many kernel versions contained the source code bug.
> (I'd like some more information on exactly what the problem was if
> anyone cares to share..the proposed patch didn't give much in the way of
> specifics.)

It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.

tglx

Chris Friesen

unread,
Jul 5, 2007, 5:03:29 PM7/5/07
to Thomas Gleixner, Clemens Koller, Chris Wright, Uli Luckas, LKML
Thomas Gleixner wrote:

> It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
> is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.

Are you certain?

Vanilla 2.6.10 shows a clock_was_set() function. Does it just not call
the dangerous code or something?

Also, our modified 2.6.10 has the high res timers patch applied, but the
config option is turned off and we were still affected.

Chris

Thomas Gleixner

unread,
Jul 5, 2007, 5:18:30 PM7/5/07
to Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Thu, 2007-07-05 at 15:02 -0600, Chris Friesen wrote:
> Thomas Gleixner wrote:
>
> > It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
> > is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.
>
> Are you certain?

At least for anything >= 2.6.16

> Vanilla 2.6.10 shows a clock_was_set() function. Does it just not call
> the dangerous code or something?

Ouch, the old posix timer code might be affected as well, but I did not
look.

> Also, our modified 2.6.10 has the high res timers patch applied, but the
> config option is turned off and we were still affected.

You mean Anzingers high res patches. No idea about those.

tglx

Ernie Petrides

unread,
Jul 5, 2007, 6:30:38 PM7/5/07
to Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Thursday, 5-Jul-2007 at 11:48 MDT, "Chris Friesen" wrote:

> Clemens Koller wrote:
>
> > Okay, we all survived Y2K and this little glitch. Puh! ;-)
> > Can you please explain in which configuration this problem got triggered.
>
> As far as I can tell many kernel versions contained the source code bug.
> (I'd like some more information on exactly what the problem was if
> anyone cares to share..the proposed patch didn't give much in the way of
> specifics.)
>
> However, in order to trigger the problem you also need to have NTP
> servers that were erroneously broadcasting the addition of a leap second.
>
> So most people didn't see the issue because there wasn't supposed to be
> a leap second added this year...but they would have seen it the next
> time a leap second was added.

Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
vulnerable.

Cheers. -ernie

Ernie Petrides

unread,
Jul 5, 2007, 7:15:14 PM7/5/07
to Chris Friesen, Thomas Gleixner, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:

> Ernie Petrides wrote:
>
> > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > vulnerable.
>

> As I mentioned in my post to Thomas, we have high res timers disabled
> and were still affected. Granted, our kernel has been modified so it is
> possible that vanilla would not be affected....I haven't tested it.
>
> Chris

That's odd, because Thomas's patch removed two calls to clock_was_set(),
which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
the 2.6.21 source tree).

Also, I personally tested with the reproducer you posted here, initially
on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
what config options were enabled on that kernel). I did reproduce the
problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.

Chris Friesen

unread,
Jul 5, 2007, 7:46:00 PM7/5/07
to Ernie Petrides, Thomas Gleixner, Clemens Koller, Chris Wright, Uli Luckas, LKML
Ernie Petrides wrote:

> That's odd, because Thomas's patch removed two calls to clock_was_set(),
> which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> the 2.6.21 source tree).

I'm using a modified 2.6.10 tree...I expect the timer code is different.

Chris

Thomas Gleixner

unread,
Jul 6, 2007, 1:16:33 AM7/6/07
to Chris Friesen, Ernie Petrides, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Thu, 2007-07-05 at 17:45 -0600, Chris Friesen wrote:
> Ernie Petrides wrote:
>
> > That's odd, because Thomas's patch removed two calls to clock_was_set(),
> > which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> > the 2.6.21 source tree).
>
> I'm using a modified 2.6.10 tree...I expect the timer code is different.

Way different and you have extra patches on top.

tglx

Thomas Gleixner

unread,
Jul 6, 2007, 1:17:59 AM7/6/07
to Ernie Petrides, Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Thu, 2007-07-05 at 19:12 -0400, Ernie Petrides wrote:
> On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:
>
> > Ernie Petrides wrote:
> >
> > > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > > vulnerable.
> >
> > As I mentioned in my post to Thomas, we have high res timers disabled
> > and were still affected. Granted, our kernel has been modified so it is
> > possible that vanilla would not be affected....I haven't tested it.
> >
> > Chris
>
> That's odd, because Thomas's patch removed two calls to clock_was_set(),
> which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> the 2.6.21 source tree).
>
> Also, I personally tested with the reproducer you posted here, initially
> on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
> what config options were enabled on that kernel). I did reproduce the
> problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.

It needs a running smp_call_function() to be interrupted by the timer
interrupt, which calls clock_was_set(). So it's not that easy to
reproduce.

tglx

Chris Friesen

unread,
Jul 6, 2007, 11:49:42 AM7/6/07
to Thomas Gleixner, Ernie Petrides, Clemens Koller, Chris Wright, Uli Luckas, LKML
Thomas Gleixner wrote:

> It needs a running smp_call_function() to be interrupted by the timer
> interrupt, which calls clock_was_set(). So it's not that easy to
> reproduce.

On our 2.6.10-based kernel its basically trivial to reproduce, and the
posted fix doesn't solve the issue.

One of our guys is trying to track it down. As yet we don't know if it's
the vanilla code or the patches on top that contain the bug.

Chris

Ernie Petrides

unread,
Jul 6, 2007, 4:05:37 PM7/6/07
to Thomas Gleixner, Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML
On Friday, 6-Jul-2007 at 7:17 +0200, Thomas Gleixner wrote:

> On Thu, 2007-07-05 at 19:12 -0400, Ernie Petrides wrote:
> > On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:
> >
> > > Ernie Petrides wrote:
> > >
> > > > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > > > vulnerable.
> > >
> > > As I mentioned in my post to Thomas, we have high res timers disabled
> > > and were still affected. Granted, our kernel has been modified so it is
> > > possible that vanilla would not be affected....I haven't tested it.
> > >
> > > Chris
> >
> > That's odd, because Thomas's patch removed two calls to clock_was_set(),
> > which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> > the 2.6.21 source tree).
> >
> > Also, I personally tested with the reproducer you posted here, initially
> > on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
> > what config options were enabled on that kernel). I did reproduce the
> > problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.
>
> It needs a running smp_call_function() to be interrupted by the timer
> interrupt, which calls clock_was_set(). So it's not that easy to
> reproduce.

I think it's reproducible at will when CONFIG_BUG is enabled, because the
WARN_ON() on line 546 of arch/i386/kernel/smp.c fires in smp_call_function(),
causing lots of console output. By the time on_each_cpu() later reenables
interrupts, another clock interrupt is pending, and (I think) causes a
self-deadlock on the xtime_lock in vmi_account_real_cycles().

That's my (unproven) theory, anyway. :)

At any rate, I reproduced it twice in two tries on stock 2.6.21.

Cheers. -ernie

0 new messages