Re: FreeBSD handles leapsecond correctly

Carlos Amengual

unread,

Jan 4, 2006, 12:25:32 AM1/4/06

to

John-Mark Gurney wrote:
>
> Having talked with a sextant user... leap seconds don't matter to
> sextant users... They have a book of times and possitions, and the
> book is only good for a year or two... so needs to be reprinted, and
> on reprints, they can take care of the leap second issue...

Those tables are valid for a year (you can use computer programs too).
If you use software, you have to fix the software, and if you use tables
then a new set of tables has to be produced, where the time argument is
not Universal Time (as until now), but Dynamical Time (TDT), or maybe
something in the middle: "Dynamical Time plus something".

> so, the only fix for a sextant is a new table of numbers... don't
> forget the tables HAVE to be updated for precisely the reason leap
> seconds are inserted...

It is not only an update, it is a change in the tables themselves. And
sidereal time would no longer be sidereal time.

The current nautical tables use UT and not TDT as time argument
precisely to avoid confusing people with timescales. Astronomical tables
already use TDT (except for sidereal time), which is the kind of
timescale that you want (TDT is based on TAI).

Regards,
Carlos Amengual
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

John-Mark Gurney

unread,

Jan 4, 2006, 12:41:46 AM1/4/06

to

Carlos Amengual wrote this message on Wed, Jan 04, 2006 at 01:24 +0100:

> John-Mark Gurney wrote:
> >so, the only fix for a sextant is a new table of numbers... don't
> >forget the tables HAVE to be updated for precisely the reason leap
> >seconds are inserted...
>
> It is not only an update, it is a change in the tables themselves. And
> sidereal time would no longer be sidereal time.

I mean how they are currently updated each year, not updated to use a
new time scale..

--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."

Carlos Amengual

unread,

Jan 4, 2006, 12:58:34 AM1/4/06

to

John-Mark Gurney wrote:
> Carlos Amengual wrote this message on Wed, Jan 04, 2006 at 01:24 +0100:
>
>>John-Mark Gurney wrote:
>>
>>>so, the only fix for a sextant is a new table of numbers... don't
>>>forget the tables HAVE to be updated for precisely the reason leap
>>>seconds are inserted...
>>
>>It is not only an update, it is a change in the tables themselves. And
>>sidereal time would no longer be sidereal time.
>
>
> I mean how they are currently updated each year, not updated to use a
> new time scale..
>

Dropping leap seconds would lead to the use of a timescale different to
UT in tables.

Of course, a point could be reached where leap seconds are frequent
enough to become a pain for computers. In that case, computers could use
TDT as time scale, but we need to have UT one way or the other.

Astronomers regularly use TDT (formerly ET) as a work timescale, and UT
only for Earth-orientation matters (Sidereal Time, which in fact defines
UT). Computers could use TDT (not TAI, which AFAIK is not in "real"
practical use elsewhere), but simply "freezing" UTC with the drop of
leap seconds just adds confusion (another TAI-based timescale in
addition to TAI itself and TDT). A "frozen" UT in current use is already
invented: TDT.

Regards,
Carlos Amengual

Rich Wales

unread,

Jan 4, 2006, 2:05:42 AM1/4/06

to

Peter Jeremy wrote:

> The Islamic calendar (at least the concensus from some googling)
> appears to have 11 leap years in a 30 (lunar) year cycle . . .

AFAIK, no, this is not correct. The Islamic calendar is a 12-month,
strictly lunar (=not= lunisolar) calendar. The Islamic year is 11
days shorter than a solar year, and there is =no= correction for
this discrepancy, so any given Islamic month/day will drift through
the seasons over a period of about 33 years.

This is why, for example, the Islamic month of fasting (Ramadan) is
a little earlier every year (same lunar month, but different solar
calendar dates).

> . . . and it seems that there isn't even general agreement on
> which years are leap years.

As I said, there's no such thing as a "leap year" in the Islamic
calendar. Were you perhaps reading about how the start of each
month (and, thus, the length of each month) is traditionally
determined by actual observation of the young crescent moon?

Rich Wales
ri...@richw.org

Matthew Dillon

unread,

Jan 6, 2006, 6:37:15 AM1/6/06

to

:Luigi Rizzo wrote:
:> On Sun, Jan 01, 2006 at 10:59:14AM +0100, Poul-Henning Kamp wrote:
:>>http://phk.freebsd.dk/misc/leapsecond.txt
:>>
:>>Notice how CLOCK_REALTIME recycles the 1136073599 second.
:>
:> on a related topic, any comments on this one ?
:> Is this code that we could use ?
:>
:> http://www.dragonflybsd.org/docs/nanosleep/
:
:I ported the tvtohz change from Dragonfly back to 4.10 and 5-STABLE here:
:
:http://www.pkix.net/~chuck/timer/
:
:...so anyone who wants to experiment can try it out. :-)
:
:--
:-Chuck

It isn't so much tvtohz that's the issue, but the fact that the
nanosleep() system call has really coarse hz-based resolution. That's
been fixed in DragonFly and I would recommend that it be fixed in
FreeBSD too. After all, there isn't much of a point having a system
call called 'nanosleep' whos time resolution is coarse-grained and
non-deterministic from computer to computer (based on how hz was
configured).

Since you seem to be depending on fine-resolution timers more and
more in recent kernels, you should consider porting our SYSTIMER API
to virtualize one-shot and periodic-timers. Look at kern/kern_systimer.c
in the DragonFly source. The code is fairly well abstracted, operates
on a per-cpu basis, and even though you don't have generic IPI messaging
I think you could port it without too much trouble.

If you port it and start using it you will quickly find that you can't
live without it. e.g. take a look at how we implement network POLLING for
an example of its use. The polling rate can be set to anything at
any time, regardless of 'hz'. Same goes for interrupt rate limiting,
various scheduler timers, and a number of other things. All the things
that should be divorced from 'hz' have been.

For people worried about edge conditions due to multiple unsynchronized
timers going off I will note that its never been an issue for us, and
in anycase it's fairly trivial to adjust the systimer code to synchronize
periodic time bases which run at integer multiples to timeout at the
same time. Most periodic time bases tend to operate in this fashion
(the stat clock being the only notable exception) so full efficiency
can be retained. But, as I said, I've actually done that and not
noticed any significant improvement in performance so I just don't bother
right now.

-Matt
Matthew Dillon
<dil...@backplane.com>

Chuck Swiger

unread,

Jan 6, 2006, 5:22:13 PM1/6/06

to

Matthew Dillon wrote:
>: Luigi Rizzo wrote:
[ ... ]

>:> on a related topic, any comments on this one ?
>:> Is this code that we could use ?
>:>
>:> http://www.dragonflybsd.org/docs/nanosleep/
>:
>:I ported the tvtohz change from Dragonfly back to 4.10 and 5-STABLE here:
>:
>:http://www.pkix.net/~chuck/timer/
>:
>:...so anyone who wants to experiment can try it out. :-)

[ ... ]

>
> It isn't so much tvtohz that's the issue, but the fact that the
> nanosleep() system call has really coarse hz-based resolution. That's
> been fixed in DragonFly and I would recommend that it be fixed in
> FreeBSD too. After all, there isn't much of a point having a system
> call called 'nanosleep' whos time resolution is coarse-grained and
> non-deterministic from computer to computer (based on how hz was
> configured).

Agreed. The changes I'd copied back basicly correct the issue of nanosleep()
waiting an extra tick, so they cut the average latency seen by something trying
to wakeup at an exact point from about 1.5 * HZ to about 0.5 * HZ, but they
don't give it finer-grained timing.

> Since you seem to be depending on fine-resolution timers more and
> more in recent kernels, you should consider porting our SYSTIMER API
> to virtualize one-shot and periodic-timers. Look at kern/kern_systimer.c
> in the DragonFly source. The code is fairly well abstracted, operates
> on a per-cpu basis, and even though you don't have generic IPI messaging
> I think you could port it without too much trouble.
>
> If you port it and start using it you will quickly find that you can't
> live without it. e.g. take a look at how we implement network POLLING for
> an example of its use. The polling rate can be set to anything at
> any time, regardless of 'hz'. Same goes for interrupt rate limiting,
> various scheduler timers, and a number of other things. All the things
> that should be divorced from 'hz' have been.

Out of curiosity, what is DragonFly doing with the network timing counters (ie,
TCPOPT_TIMESTAMP and the stuff in <netinet/tcp_timer.h>), has that been
seperated from HZ too?

I'm pretty sure that setting:

#define TCPTV_MSL ( 30*hz) /* max seg lifetime (hah!) */

...with HZ=1000 or more is not entirely correct. :-) Not when it started with
the TTL in hops being equated to one hop per second...

--
-Chuck

Matthew Dillon

unread,

Jan 6, 2006, 6:25:11 PM1/6/06

to

:Out of curiosity, what is DragonFly doing with the network timing counters (ie,

:TCPOPT_TIMESTAMP and the stuff in <netinet/tcp_timer.h>), has that been
:seperated from HZ too?
:
:I'm pretty sure that setting:
:
:#define TCPTV_MSL ( 30*hz) /* max seg lifetime (hah!) */
:
:...with HZ=1000 or more is not entirely correct. :-) Not when it started with
:the TTL in hops being equated to one hop per second...
:
:--
:-Chuck

Well, you know what they say... if it aint broke, don't fix it. In this
case the network stacks use that wonderful callwheel code that was
written years ago (in FreeBSD). SYSTIMERS aren't designed to handle
billions of timers like the callwheel code is so it wouldn't be a
proper application.

The one change I made to the callwheel code was to make it per-cpu in
order to guarentee that e.g. a device driver that installs an interrupt
and a callout would get both on the same cpu and thus be able to use
normal critical sections to interlock between them. This is a
particularly important aspect of our lockless per-cpu tcp protocol
threads. DragonFly's crit_enter()/crit_exit() together only take 9ns
(with INVARIANTS turned on), whereas the minimum non-contended inline
mutex (lwkt_serialize_enter()/exit()) takes around 20ns.

I don't know what edge cases exist when 'hz' is set so high. Since we
don't use hz for things that would normally require it to be set to a
high frequency, we just leave hz set to 100.

--

One side note. I've found both our userland (traditional bsd4) and
our LWKT scheduler to be really finicky about being properly woken
up via AST when a reschedule is required. Preemption by <<non-interrupt>>
threads is not beneficial at all since most kernel ops take < 1uS to
execute major operations. 'hz' is not relevant because it only effects
processes operating in batch. But 'forgetting' to queue an AST to
reschedule a thread ASAP (without preempting) when you are supposed
to can result in terrible interactive response because you have
processes winding up using their whole quantum before they realize
that they should have rescheduled. I've managed to break this three
times over the years in DragonFly... stupid things like forgetting a
crit_exit() or clearing the reschedule bit without actually rescheduling
or doing the wrong check in doreti(), etc. The bugs often went unnoticed
for weeks because it wasn't noticed until someone did some heavily
cpu-bound work or test. It is the A#1 problem that you have to look
for if you have scheduler issues. All non-interrupt-thread preemption
accomplishes is to blow up your caches and prevent you from being able
to aggregate work between threads (which could be especially important
since your I/O is threaded in FreeBSD).

-Matt
Matthew Dillon
<dil...@backplane.com>

Andre Oppermann

unread,

Jan 7, 2006, 4:26:50 PM1/7/06

to

Matthew Dillon wrote:
> :Luigi Rizzo wrote:

> :> On Sun, Jan 01, 2006 at 10:59:14AM +0100, Poul-Henning Kamp wrote:
> :>>http://phk.freebsd.dk/misc/leapsecond.txt
> :>>
> :>>Notice how CLOCK_REALTIME recycles the 1136073599 second.
> :>

> :> on a related topic, any comments on this one ?
> :> Is this code that we could use ?
> :>
> :> http://www.dragonflybsd.org/docs/nanosleep/
> :
> :I ported the tvtohz change from Dragonfly back to 4.10 and 5-STABLE here:
> :
> :http://www.pkix.net/~chuck/timer/
> :
> :...so anyone who wants to experiment can try it out. :-)

> :
> :--
> :-Chuck

>
> It isn't so much tvtohz that's the issue, but the fact that the
> nanosleep() system call has really coarse hz-based resolution. That's
> been fixed in DragonFly and I would recommend that it be fixed in
> FreeBSD too. After all, there isn't much of a point having a system
> call called 'nanosleep' whos time resolution is coarse-grained and
> non-deterministic from computer to computer (based on how hz was
> configured).
>

> Since you seem to be depending on fine-resolution timers more and
> more in recent kernels, you should consider porting our SYSTIMER API
> to virtualize one-shot and periodic-timers. Look at kern/kern_systimer.c
> in the DragonFly source. The code is fairly well abstracted, operates
> on a per-cpu basis, and even though you don't have generic IPI messaging
> I think you could port it without too much trouble.
>
> If you port it and start using it you will quickly find that you can't
> live without it. e.g. take a look at how we implement network POLLING for
> an example of its use. The polling rate can be set to anything at
> any time, regardless of 'hz'. Same goes for interrupt rate limiting,
> various scheduler timers, and a number of other things. All the things
> that should be divorced from 'hz' have been.
>

> For people worried about edge conditions due to multiple unsynchronized
> timers going off I will note that its never been an issue for us, and
> in anycase it's fairly trivial to adjust the systimer code to synchronize
> periodic time bases which run at integer multiples to timeout at the
> same time. Most periodic time bases tend to operate in this fashion
> (the stat clock being the only notable exception) so full efficiency
> can be retained. But, as I said, I've actually done that and not
> noticed any significant improvement in performance so I just don't bother
> right now.

Matt,

I've been testing network and routing performance over the past two weeks
with an calibrated Agilent N2X packet generator. My test box is a dual
Opteron 852 (2.6Ghz) with Tyan S8228 mobo and Intel dual-GigE in PCI-X-133
slot. Note that I've run all tests with UP kernels em0->em1.

For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-
forward enabled. A em(4) patch from Scott Long implementing a taskqueue
raised this to 729kpps.

For stock DragonFlyBSD-1.4-RC1 I've got 327kpps and then it breaks down and
never ever passes a packet again until a down/up on the receiving interface.
net.inet.ip.intr_queue_maxlen has to be set to 200, otherwise it breaks down
at 252kpps already. Enabling polling did not make a difference and I've tried
various settings and combinations without any apparent effect on performance
(burst=1000, each_burst=50, user_frac=1, pollhz=5000).

What suprised me most, apart from the generally poor performance, is the sharp
dropoff after max pps and the wedging of the interface. I didn't see this kind
of behaviour on any other OS I've tested (FreeBSD and OpenBSD).

--
Andre

Matthew Dillon

unread,

Jan 7, 2006, 7:42:04 PM1/7/06

to

:Matt,

:
:I've been testing network and routing performance over the past two weeks
:with an calibrated Agilent N2X packet generator. My test box is a dual
:Opteron 852 (2.6Ghz) with Tyan S8228 mobo and Intel dual-GigE in PCI-X-133
:slot. Note that I've run all tests with UP kernels em0->em1.
:
:For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-
:forward enabled. A em(4) patch from Scott Long implementing a taskqueue
:raised this to 729kpps.
:
:For stock DragonFlyBSD-1.4-RC1 I've got 327kpps and then it breaks down and
:never ever passes a packet again until a down/up on the receiving interface.
:net.inet.ip.intr_queue_maxlen has to be set to 200, otherwise it breaks down
:at 252kpps already. Enabling polling did not make a difference and I've tried
:various settings and combinations without any apparent effect on performance
:(burst=1000, each_burst=50, user_frac=1, pollhz=5000).
:
:What suprised me most, apart from the generally poor performance, is the sharp
:dropoff after max pps and the wedging of the interface. I didn't see this kind
:of behaviour on any other OS I've tested (FreeBSD and OpenBSD).
:
:--
:Andre

Well, considering that we haven't removed the MP lock from the network
code yet, I'm not surprised at the poorer performance. The priority has
been on getting the algorithms in, correct, and stable, proving their
potential, but not hacking things up to eek out maximum performance
before its time. At the moment there is a great deal of work slated for
1.5 to properly address many of the issues.

Remember that the difference between 327kps and 792kps is the difference
between 3 uS and 1.2 uS per packet of overhead. That isn't all that
huge a difference, really, especially considering that everything is
serialized down to effectively 1 cpu due to the MP lock.

:For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-

:forward enabled. A em(4) patch from Scott Long implementing a taskqueue
:raised this to 729kpps.

The single biggest overhead we have right now is that we have not
yet embedded a LWKT message structure in the mbuf. That means we
are currently malloc() and free()ing a message structure for every
packet, costing at least 700 nS in additional overhead and possibly
more if a cross-cpu free is needed (even with the passive IPIQ the
free() code does in that case). This problem is going to be fixed once
1.4 is released, but in order to do it properly I intend to completely
separate the mbuf data vs header concept... give them totally different
structural names instead of overloading them with a union, then embedding
the LWKT message structure in the mbuf_pkt.

Another example would be our IP forwarding code. Hahahah. I'm amazed
that it only takes 3 uS considering that it is running under both the
MP lock *AND* the new mutex-like serializer locks that will be replacing
the MP lock in the network subsystem AND hacking up those locks (so there
are four serializer locking operations per packet plus the MP lock).

The interrupt routing code has similar issues. The code is designed to
be per-cpu and tested in that context (by testing driver entry from other
cpus), but all hardware interrupts are still being taken on cpu #0, and
all polling is issued on cpu #0. This adds considerable overhead,
though it is mitigated somewhat by packet aggregation.

There are two or three other non-algorithmic issues of that nature in
the current network path that exist to allow the old algorithms to be
migrated to the new ones and which are slowly being cleaned up. I'm not
at all surprised that all of these shims cost us 1.8 uS in overhead.
I've run end-to-end timing tests for a number of operations, which you
can see from my BayLisa slides here:

http://www.dragonflybsd.org/docs/LISA200512/

What I have found is that the algorithms are sound and the extra overheads
are basically just due to the migrationary hacks (like the malloc).
Those tests also tested that our algorithms are capable of pipelining
(MP safe wise) between the network interrupt and TCP or UDP protocol
stacks, and they can with only about 40 ns of IPI messaging overhead.
There are sysctls for testing the MP safe interrupt path, but they aren't
production ready yet (because they aren't totally MP safe due to the
route table, IP filter, and mbuf stats which are the only remaining
items that need to be made MP safe).

Frankly, I'm not really all that concerned about any of this. Certainly
not raw routing overhead (someone explain to me why you don't simply buy
a cisco, or write a custom driver if you really need to pop packets
between interfaces at 1 megapps instead of trying to use a piece of
generic code in a generic operating system to do it). Our focus is
frankly never going to be on raw packet switching because there is no
real-life situation where you would actually need to switch such a high
packet rate where you wouldn't also have the budget to simply buy an
off-the-shelf solution.

Our focus vis-a-vie the network stack is going to be on terminus
communications, meaning UDP and TCP services terminated or sourced on
the machine. All the algorithms have been proved out, the only thing
preventing me from flipping the MP lock off are the aformentioned
mbuf stats, route table, and packet filter code. In fact, Jeff *has*
turned off the MP lock for the TCP protocol threads for testing purposes,
with very good results. The route table is going to be fixed this month
when we get Jeff's MPSAFE parallel route table code into the tree. The
mbuf stats are a non-problem, really, just some minor work. The packet
filter(s) are more of an issue.

The numbers I ran for the BayLisa talk show our network interrupt overhead
is around 1-1.5 uS per packet, and our TCP overhead is around
1-1.5 uS per packet. 700 ns of that is the aformentioned malloc/free
issue, and a good chunk of the remaining overhead is MP lock related.

:For stock FreeBSD-7-CURRENT from 28. Dec. 2005 I've got 580kpps with fast-

:forward enabled. A em(4) patch from Scott Long implementing a taskqueue
:raised this to 729kpps.

An interface lockup is a different matter. Nothing can be said about
that until the cause of the problem is tracked down. I can't speculate
as to the problem without more information.

-Matt
Matthew Dillon
<dil...@backplane.com>

Scott Long

unread,

Jan 7, 2006, 8:25:11 PM1/7/06

to

Matthew Dillon wrote:
[...]

I'm about to release a patch to Andre that should allow if_em to fast
forward 1mpps or more on his hardware, using no shortcuts or hacks other
than the inherent shortcut that the ffwd code provides. The approach
I'm taking also works on the other high performance network interfaces.
There is also a lot of work going on to streamline the ifnet layer that
will likely result in several hundred nanoseconds of latency being
removed from there. I'd personally love to see DragonFly approach this
level of performance. Given that it took FreeBSD about 3-4 years to
slog through setting up and validating a new architecture before we
could start focusing on performance, I think that DFly is right on track
on the same schedule. Hopefully the results are as worthwhile on DFly
in the future as they are on FreeBSD right now.

Scott

Matthew Dillon

unread,

Jan 7, 2006, 8:41:59 PM1/7/06

to

:

:Matthew Dillon wrote:
:[...]
:
:I'm about to release a patch to Andre that should allow if_em to fast
:forward 1mpps or more on his hardware, using no shortcuts or hacks other
:than the inherent shortcut that the ffwd code provides. The approach
:I'm taking also works on the other high performance network interfaces.
:There is also a lot of work going on to streamline the ifnet layer that
:will likely result in several hundred nanoseconds of latency being
:removed from there. I'd personally love to see DragonFly approach this
:level of performance. Given that it took FreeBSD about 3-4 years to
:slog through setting up and validating a new architecture before we
:could start focusing on performance, I think that DFly is right on track
:on the same schedule. Hopefully the results are as worthwhile on DFly
:in the future as they are on FreeBSD right now.
:
:Scott

I think it's very possible. We have taken pains to retain the
fast-forwarding architecture (i.e. direct lookups, no context switches,
handle everything in the interrupt) and to greatly reduce mbuf allocation
overheads (e.g. by using Jeff's objcache infrastructure, which is an
algorithmic port of Sun's objcache infrastructure).

There are three areas of interest for us in this architecutre:

(1) Route table lookups. This is basically a non-problem because Jeff
already has a per-cpu route table replication patch that will allow
us to do route table lookups without having to obtain or release
any locks or perform any bus locked instructions.

(2) Per-interface serializer (mutex). Right now the core packet
processing loop must release the originating interface serializer
and obtain the target interface serializer to forward a packet,
then release the target and re-obtain the originating. Clearly
this can be cleaned up by aggregating packets processed by the
originating interface and only doing the swap-a-roo once for N
packets.

The current code is a migratory 'hack' until the whole network
subsystem can be moved to the new network interface serializer.
Right now only the network interrupt subsystem has been moved to
the new serializer, so networking code is holding both the MP lock
AND the serializer.

(3) The IP filter / firewall code is the last big item we are going to
have a problem with. I intend to remove at least one of the packet
filters we support and do per-cpu replication for the remainder.
It turns out that most of the packet filter can be replicated, even
dynamically generated rules and queues and such. But its a lot of
work.

-Matt
Matthew Dillon
<dil...@backplane.com>

Andre Oppermann

unread,

Jan 8, 2006, 11:18:10 AM1/8/06

to

This was using the UP kernel. No SMP, only one CPU. The CPU was not maxed
out as shown by top. There must be something else that is killing performance
on DragonFlyBSD.

--
Andre