Linux 2.6.29-rc6

Linus Torvalds

unread,

Feb 22, 2009, 11:31:31 PM2/22/09

to Linux Kernel Mailing List

This is mostly lots of small fixes, with the stats being dominated by some
DocBook movement and an ia64 defconfig addition:

20.4% Documentation/DocBook/
3.9% Documentation/
2.0% arch/arm/
30.2% arch/ia64/configs/
5.5% arch/x86/
2.4% arch/
3.8% drivers/gpu/drm/i915/
2.3% drivers/scsi/
12.6% drivers/
2.2% fs/btrfs/
5.5% fs/cifs/
2.3% fs/

(the above is the "non-cumulative" dirstat, which doesn't add up
subdirectories cumulatively, and thus highlights individual directories
that contain changes, rather than the top-level directories).

But most of the changes are really pretty small, and the shortlog gives a
feel for it. About 350 files changed, averaging roughly 20 lines of
changes per file - but the average is somewhat misleading, because most
changes are just a couple of lines, and then the "big" changes are about
moving a few hundred lines of documentation or the 1601 lines of
defconfig.

Regressions fixed, small cleanups, and some changes to help future
merging.

Linus

---
Adam Baker (1):
V4L/DVB (10619): gspca - main: Destroy the URBs at disconnection time.

Adam Lackorzynski (1):
jsm: additional device support

Al Viro (1):
Fix incomplete __mntput locking

Alan Jenkins (1):
PM/hibernate: fix "swap breaks after hibernation failures"

Alex Chiang (3):
PCI: Documentation: fix minor PCIe HOWTO thinko
[IA64] Revert "prevent ia64 from invoking irq handlers on offline CPUs"
[IA64] Remove redundant cpu_clear() in __cpu_disable path

Alexey Dobriyan (3):
kbuild: fix tags generation of config symbols
mfd: fix sm501 section mismatches
eeepc: should depend on INPUT

Alexey Starikovskiy (1):
ACPI: EC: Add delay for slow MSI controller

Alok N Kataria (1):
x86, vmi: TSC going backwards check in vmi clocksource

Andi Kleen (4):
kbuild: create the source symlink earlier in the objdir
x86, mce: reinitialize per cpu features on resume
x86, mce: use force_sig_info to kill process in machine check
x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown

Andrew Vasquez (3):
[SCSI] qla2xxx: Properly acknowledge IDC notification messages.
[SCSI] qla2xxx: Mask out 'reserved' bits while processing FLT regions.
[SCSI] qla2xxx: Update version number to 8.03.00-k3.

Andrew Victor (2):
[ARM] 5390/1: AT91: Watchdog fixes
[ARM] 5391/1: AT91: Enable GPIO clocks earlier

Andrey Borzenkov (1):
PM: Fix pm_notifiers during user mode hibernation

Aneesh Kumar K.V (3):
ext4: Fix lockdep warning
ext4: Initialize preallocation list_head's properly
ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages

Anirban Chakraborty (2):
[SCSI] qla2xxx: Remove interrupt request bit check in the response processing path in multiq mode.
[SCSI] qla2xxx: Correct slab-error overwrite during vport creation and deletion.

Anssi Hannula (1):
HID: move tmff and zpff devices from ignore_list to blacklist

Arjan van de Ven (4):
scripts: add x86 register parser to markup_oops.pl
scripts: add x86 64 bit support to the markup_oops.pl script
Consolidate driver_probe_done() loops into one place
PM/resume: wait for device probing to finish

Arve Hjønnevåg (2):
PM: Wait for console in resume
PM: Fix suspend_console and resume_console to use only one semaphore

Atsushi Nemoto (1):
atmel_serial might lose modem status change

Avi Kivity (2):
KVM: Avoid using CONFIG_ in userspace visible headers
KVM: VMX: Flush volatile msrs before emulating rdmsr

Benjamin Herrenschmidt (1):
vmalloc: add __get_vm_area_caller()

Bernhard Walle (1):
Bernhard has moved

Bill Nottingham (1):
vt: Declare PIO_CMAP/GIO_CMAP as compatbile ioctls.

Bjorn Helgaas (1):
ACPI: remove CONFIG_ACPI_SYSTEM

Boaz Harrosh (1):
bsg: Fix sense buffer bug in SG_IO

Brian King (3):
[SCSI] ibmvfc: Fix command timeout errors
[SCSI] ibmvfc: Fix rport relogin
[SCSI] ibmvfc: Increase cancel timeout

Chip Coldwell (1):
cciss: PCI power management reset for kexec

Chris Ball (1):
x86, olpc: fix model detection without OFW

Chris Mason (5):
Btrfs: process mount options on mount -o remount,
Btrfs: use larger metadata clusters in ssd mode
Btrfs: don't clean old snapshots on sync(1)
Btrfs: make a lockdep class for the extent buffer locks
Btrfs: check file pointer in btrfs_sync_file

Chris Wilson (16):
drm: Potential use-after-free on error path.
drm: Free the object ref on error.
drm/i915: Cleanup trivial leak on execbuffer error path.
drm/i915: hold mutex for unreference() in i915_gem_tiling.c
drm/i915: refleak along pin() error path.
drm: Do not leak a new reference for flink() on an existing name
drm/i915: Set framebuffer alignment based upon the fence constraints.
drm/i915: Release and unlock on mmap_gtt error path.
drm/i915: unpin for an invalid memory domain.
drm/i915: Unpin the ringbuffer if we fail to ioremap it.
drm/i915: Unpin the hws if we fail to kmap.
drm/i915: Unpin the fb on error during construction.
drm/i915: Cleanup the hws on ringbuffer constrution failure.
drm: Check for a NULL encoder when reverting on error path
drm: Propagate failure from setting crtc base.
drm/i915: Fix regression in 95ca9d

Christian Borntraeger (1):
[S390] Fix timeval regression on s390

Clemens Ladisch (2):
sound: usb-audio: fix uninitialized variable with M-Audio MIDI interfaces
sound: virtuoso: revert "do not overwrite EEPROM on Xonar D2/D2X"

Dan Carpenter (3):
ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
HID: unlock properly on error paths in hidraw_ioctl()
sx.c: avoid referencing freed memory if copy_from_user() fails

Dan Williams (1):
atmel-mci: fix initialization of dma slave data

Dave Hansen (1):
powerpc/mm: Fix numa reserve bootmem page selection

David Brownell (2):
omap_hsmmc: card detect irq bugfix
omap_hsmmc: only MMC1 allows HCTL.SDVS != 1.8V

David Howells (1):
mn10300: fix oprofile

David Vrabel (1):
wusb: whci-hcd: always lock whc->lock with interrupts disabled

David Woodhouse (2):
iommu: fix Intel IOMMU write-buffer flushing
Fix Intel IOMMU write-buffer flushing

Davide Libenzi (1):
timerfd: add flags check

Ed L. Cashin (1):
aoe: ignore vendor extension AoE responses

Eric Anholt (3):
drm/i915: Cut two args to set_to_gpu_domain that confused this tricky path.
drm/i915: Don't let a device flush to prepare buffers clear new write_domains.
drm/i915: Retire requests from i915_gem_busy_ioctl.

Eric Biederman (1):
seq_file: properly cope with pread

Felix Blyakher (2):
Revert "[XFS] use scalable vmap API"
Revert "[XFS] remove old vmap cache"

Frank Seidel (1):
MAINTAINERS: Switch hdaps to Frank Seidel

Frederic Weisbecker (1):
tracing/function-graph-tracer: trace the idle tasks

Geert Uytterhoeven (1):
m68k: atari - Rename "mfp" to "st_mfp"

Geoff Levand (1):
powerpc/ps3: Move ps3_mm_add_memory to device_initcall

Giuseppe Bilotta (2):
lis3lv02d: support both one- and two-byte sensors
lis3lv02d: add axes knowledge of HP Pavilion dv5 models

Gregory CLEMENT (1):
[ARM] 5400/1: Add support for inverted rdy_busy pin for Atmel nand device controller

H. Peter Anvin (1):
x86, mce: remove incorrect __cpuinit for mce_cpu_features()

Hannes Reinecke (1):
block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list

Hans Verkuil (2):
V4L/DVB (10625): ivtv: fix decoder crash regression
V4L/DVB (10626): ivtv: fix regression in get sliced vbi format

Hans de Goede (1):
hwmon: Fix ACPI resource check error handling

Hartley Sweeten (1):
[ARM] 5405/1: ep93xx: remove unused gesbc9312.h header

Heiko Carstens (1):
[S390] fix "mem=" handling in case of standby memory

Helmut Schaa (1):
sdhci: fix led naming

Herbert Xu (1):
crypto: lrw - Fix big endian support

Igor Mammedov (1):
[CIFS] Prevent OOPs when mounting with remote prefixpath.

Ilpo Järvinen (1):
sx.c: fix dbl statement if - add missing braces

Ingo Molnar (4):
sched: cpu hotplug fix
inotify: fix GFP_KERNEL related deadlock
x86: use the right protections for split-up pagetables
PM: Split up sysdev_[suspend|resume] from device_power_[down|up], fix

Isaku Yamahata (1):
[IA64] fixes configs and add default config for ia64 xen domU

James Smart (1):
[SCSI] scsi_scan: add missing interim SDEV_DEL state if slave_alloc fails

Jan Kara (3):
jbd2: Fix return value of jbd2_journal_start_commit()
Revert "ext4: wait on all pending commits in ext4_sync_fs()"
jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()

Jean Delvare (2):
mfd: terminate pcf50633 i2c_device_id list
hwmon: (f71882fg) Hide misleading error message

Jean Pihet (2):
omap_hsmmc: recover from transfer failures
omap_hsmmc: Change while(); loops with finite version

Jeff Layton (3):
cifs: refactor new_inode() calls and inode initialization
cifs: properly handle case where CIFSGetSrvInodeNumber fails
cifs: posix fill in inode needed by posix open

Jeff Mahoney (2):
Btrfs: balance_level checks !child after access
Btrfs: remove btrfs_init_path

Jens Axboe (2):
block: fix bad definition of BIO_RW_SYNC
block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb

Jeremy Fitzhardinge (2):
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption

Jesse Barnes (4):
drm/i915: take struct mutex around fb unref
drm/i915: Keep refs on the object over the lifetime of vmas for GTT mmap.
drm/i915: suspend/resume GEM when KMS is active
drm/i915: fix WC mapping in non-GEM i915 code.

Jiri Slaby (3):
HID: fix bus endianity in file2alias
x86_64: acpi/wakeup_64 cleanup
x86_64: Fix S3 fail path

Johannes Weiner (3):
slab: introduce kzfree()
swsusp: dont fiddle with swappiness
swsusp: clean up shrink_all_zones()

John Stultz (1):
x86, hpet: fix for LS21 + HPET = boot hang

Joris van Rantwijk (1):
ALSA: usb-audio - Workaround for misdetected sample rate with CM6207

Josef Bacik (1):
Btrfs: make sure all pending extent operations are complete

Josh Hunt (1):
kbuild: add vmlinux to kernel rpm

Julia Lawall (3):
[SCSI] lpfc: introduce missing kfree
Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
mfd: Fix egpio kzalloc return test

KAMEZAWA Hiroyuki (2):
mm: clean up for early_pfn_to_nid()
mm: fix memmap init for handling memory hole

Kristian Høgsberg (5):
drm: Release user fbs in drm_release
drm: Add locking around cursor gem operations.
drm: Bring PLL limits in sync with DDX values.
drm: Collapse identical i8xx_clock() and i9xx_clock().
drm: Use spread spectrum when the bios tells us it's ok.

Krzysztof Helt (1):
fbdev/drm: fix Kconfig submenu mess in "Graphics support"

Li Zefan (4):
cgroups: update documentation about css_set hash table
cgroups: fix possible use after free
README: fix a wrong filename
cpuset: various documentation fixes and updates

Linus Torvalds (2):
x86: Add IRQF_TIMER to legacy x86 timer interrupt descriptors
Linux 2.6.29-rc6

Luca Bigliardi (1):
uml: fix vde network backend in user mode linux

Makito SHIOKAWA (1):
[ARM] 5404/1: Fix condition in arm_elf_read_implies_exec() to set READ_IMPLIES_EXEC

Marcelo Tosatti (4):
KVM: mmu_notifiers release method
KVM: PIT: fix i8254 pending count read
KVM: x86: disable kvmclock on non constant TSC hosts
KVM: x86: fix LAPIC pending count calculation

Mark Brown (5):
mfd: Initialise WM8350 interrupts earlier
mfd: Improve diagnostics for WM8350 ID register probe
mfd: Mark WM835x USB_SLV_500MA bit as accessible
mfd: Fix TWL4030 build on some ARM variants
mfd: Ensure all WM8350 IRQs are masked at startup

Mark McLoughlin (1):
KVM: Fix assigned devices circular locking dependency

Markus Metzger (1):
x86, ptrace, mm: fix double-free on race

Martin Peschke (1):
[SCSI] sg: fix device number in blktrace data

Matthew Wilcox (1):
PCI/MSI: fix msi_mask() shift fix

Mauro Carvalho Chehab (3):
V4L/DVB (10527): tuner: fix TUV1236D analog/digital setup
V4L/DVB (10572): Revert commit dda06a8e4610757def753ee3a541a0b1a1feb36b
8250: fix boot hang with serial console when using with Serial Over Lan port

Michael Buesch (2):
spi-gpio: sanitize MISO bitvalue
spi_bitbang: add more lowlevel function documentation

Michael Neuling (2):
powerpc/vsx: Fix VSX alignment handler for regs 32-63
bootgraph: fix for use with dot symbols

Michael Tokarev (1):
HID: blacklist Powercom USB UPS

Mike Christie (1):
[SCSI] libiscsi: Fix scsi command timeout oops in iscsi_eh_timed_out

Mike Frysinger (1):
kbuild,setlocalversion: shorten the make time when using svn

Mike Murphy (2):
PATCH [1/2] Documentation/driver-model/device.txt: fix struct device_attribute
PATCH [2/2] Documentation/filesystems/sysfs.txt: fix descriptions of device attributes

Neil Brown (1):
block: fix booting from partitioned md array

Nick Piggin (1):
mm: task dirty accounting fix

Nicolas Pitre (2):
[ARM] 5401/1: Orion: fix edge triggered GPIO interrupt support
[ARM] 5402/1: fix a case of wrap-around in sanity_check_meminfo()

Paul E. McKenney (1):
x86, rcu: fix strange load average and ksoftirqd behavior

Paul Moore (2):
cipso: Fix documentation comment
selinux: Fix the NetLabel glue code for setsockopt()

Paul Turner (1):
vfs: separate FMODE_PREAD/FMODE_PWRITE into separate flags

Pavel Machek (2):
Pavel has moved
hp accelerometer: add freefall detection

Pekka Paalanen (3):
mmiotrace: count events lost due to not recording
trace: mmiotrace to the tracer menu in Kconfig
doc: mmiotrace.txt, buffer size control change

Peter Oberparleiter (1):
[S390] sclp: handle empty event buffers

Peter Zijlstra (3):
futex: fix reference leak
timers: more consistently use clock vs timer
fs/super.c: add lockdep annotation to s_umount

Philipp Zabel (1):
mfd: fix htc-egpio iomem resource handling using resource_size

Philippe De Muyter (1):
floppy: request and release only the ports we actually use

Philippe Gerum (1):
powerpc/mm: Fix _PAGE_CHG_MASK to protect _PAGE_SPECIAL

Pierre Ossman (1):
Revert "sdhci: force high speed capability on some controllers"

Pierre Willenbrock (1):
drm/i915: Add missing mutex_lock(&dev->struct_mutex)

Qinghuang Feng (1):
Btrfs: remove unused code in split_state()

Rabin Vincent (2):
kbuild: add sys_* entries for syscalls in tags
mmc_test: fix basic read test

Rafael J. Wysocki (4):
USB/PCI: Fix resume breakage of controllers behind cardbus bridges
pm: fix build for CONFIG_PM unset
PM: fix build for CONFIG_PM unset
PM: Split up sysdev_[suspend|resume] from device_power_[down|up]

Rakib Mullick (1):
mfd: Fix sm501_register_gpio section mismatch

Randy Dunlap (7):
PCI: fix rom.c kernel-doc warning
PCI: fix struct pci_platform_pm_ops kernel-doc
PCI: fix missing kernel-doc and typos
x86: dell-laptop: depends on POWER_SUPPLY
docsrc: use config instead of menuconfig
docbook: split kernel-api for device-drivers
acpi/doc: add missing param value

Richard Hughes (1):
battery: don't assume we are fully charged when not charging or discharging

Robert Jennings (1):
[SCSI] ibmvscsi: Correct DMA mapping leak

Robin Holt (1):
[IA64] bte_copy of BTE_MAX_XFER trips BUG_ON.

Roel Kluin (4):
mfd: wm8350 tries reaches -1
FRV: __pte_to_swp_entry doesn't expand correctly
paride/pg.c: xs(): &&/|| confusion
[ARM] 5403/1: pxa25x_ep_fifo_flush() *ep->reg_udccs always set to 0

Roland Dreier (1):
drm/i915: Fix potential AB-BA deadlock in i915_gem_execbuffer()

Russell King (3):
[ARM] omap: fix omap2_divisor_to_clksel() error return value
[ARM] omap: fix _omap2_clksel_get_src_field()
[ARM] omap: fix clock reparenting in omap2_clk_set_parent()

Rusty Russell (2):
cpumask: fix powernow-k8: partial revert of 2fdf66b491ac706657946442789ec644cc317e1a
cpumask: Use cpu_*_mask accessors code: alpha

Sergei Shtylyov (1):
libata-sff: fix 32-bit PIO ATAPI regression

Sheng Yang (4):
KVM: Add kvm_arch_sync_events to sync with asynchronize events
KVM: Fix racy in kvm_free_assigned_irq
KVM: MMU: Map device MMIO as UC in EPT
KVM: Fix INTx for device assignment

Shyam...@Dell.com (1):
[SCSI] qla2xxx: fix Kernel Panic with Qlogic 2472 Card.

Steve Aarnio (1):
drm/i915: Don't add panel_fixed_mode to the probed modes list at LVDS init.

Steve French (4):
[CIFS] ipv6_addr_equal for address comparison
[CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS
[CIFS] improve posix semantics of file create
[CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts

Steven Rostedt (3):
tracing: disable tracing while testing ring buffer
tracing: have function trace select kallsyms
tracing: limit the number of loops the ring buffer self test can make

Subhash Peddamallu (1):
fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free

Suresh Siddha (1):
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem

Takashi Iwai (3):
Revert "Sound: hda - Restore PCI configuration space with interrupts off"
ALSA: usb-audio - Fix non-continuous rate detection
ALSA: jack - Use card->shortname for input name

Tejun Heo (2):
sata_nv: give up hardreset on nf2
vmalloc: call flush_cache_vunmap() from unmap_kernel_range()

Thomas Gleixner (3):
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86: CPA avoid repeated lazy mmu flush
x86, vm86: fix preemption bug

Tobias Klauser (1):
drm/i915: Storage class should be before const qualifier

Tobias Lorenz (2):
V4L/DVB (10532): Correction of Stereo detection/setting and signal strength indication
V4L/DVB (10533): fix LED status output

Tony Luck (2):
[IA64] Build fix for __early_pfn_to_nid() undefined link error
[IA64] xen_domu build fix

Tony Vroon (1):
fujitsu-laptop: Use RFKILL support bitmask from firmware

Trent Piepho (1):
V4L/DVB (10516a): zoran: Update MAINTAINERS entry

Wei Yongjun (2):
ext4: Fix to read empty directory blocks correctly in 64k
mn10300: fix typo && -> || in arch/mn10300/unit-asb2305/pci.c

Wim Van Sebroeck (1):
[WATCHDOG] iTCO_wdt: fix SMI_EN regression 2

Yan Zheng (2):
Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
Btrfs: hold trans_mutex when using btrfs_record_root_in_trans

Yang Hongyang (1):
atyfb: remove unused local variable `pwr_command'

Yang Zhang (1):
KVM: ia64: fix fp fault/trap handler

Yauhen Kharuzhy (1):
s3cmci: Fix hangup in do_pio_write()

Yi Li (1):
MMC: fix bug - SDHC card capacity not correct

Zachary Amsden (1):
MAINTAINERS: paravirt-ops maintainers update

Zlatko Calusic (1):
Add support for VT6415 PCIE PATA IDE Host Controller

etienne (1):
drm/radeon: update sarea copies of last_ variables on resume.

wanzongshun (1):
[ARM] 5398/1: Add Wan ZongShun to MAINTAINERS for W90P910
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Karsten Wiese

unread,

Feb 23, 2009, 9:08:24 AM2/23/09

to Linus Torvalds, Eric Anholt, Linux Kernel Mailing List

Fix an oops in i915_gem_retire_requests()

dev_priv->hw_status_page can be NULL, if i915_gem_retire_requests()
is called from i915_gem_busy_ioctl().

Signed-off-by Karsten Wiese <f...@wemgehoertderstaat.de>
---
drivers/gpu/drm/i915/i915_gem.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 25b3374..28b726d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1051,6 +1051,9 @@ i915_gem_retire_requests(struct drm_device *dev)
drm_i915_private_t *dev_priv = dev->dev_private;
uint32_t seqno;

+ if (!dev_priv->hw_status_page)
+ return;
+
seqno = i915_get_gem_seqno(dev);

while (!list_empty(&dev_priv->mm.request_list)) {
--
1.6.0.6

Jesper Krogh

unread,

Feb 26, 2009, 6:16:28 AM2/26/09

to Linus Torvalds, linux-...@vger.kernel.org

Booting up 2.6.29-rc6 gave me this one in dmesg...

[ 21.136149] ck804xrom ck804xrom_init_one(): Unable to register
resource 0x00000000ff000000-0x00000000ffffffff - kernel bug?
[ 21.136258] resource map sanity check conflict: 0xff000000 0xffffffff
0xff700000 0xffffffff reserved
[ 21.136267] ------------[ cut here ]------------
[ 21.136269] WARNING: at arch/x86/mm/ioremap.c:208
__ioremap_caller+0x359/0x390()
[ 21.136271] Hardware name: Sun Fire X2200 M2 with Quad Core Processor
[ 21.136273] Info: mapping multiple BARs. Your kernel is fine.Modules
linked in: ck804xrom(+) mtd chipreg pcspkr(+) shpchp button pci_hotplug
i2c_nforce2 i2c_core map_funcs evdev ext3 jbd mbcache sg sd_mod usbhid
hid amd74xx sata_nv tg3 ata_generic libphy ehci_hcd libata ohci_hcd
forcedeth scsi_mod usbcore thermal processor fan thermal_sys fuse
[ 21.136289] Pid: 3843, comm: modprobe Not tainted 2.6.29-rc6 #2
[ 21.136291] Call Trace:
[ 21.136298] [<ffffffff8023d352>] warn_slowpath+0xf2/0x130
[ 21.136301] [<ffffffff8023d62a>] __call_console_drivers+0x6a/0x90
[ 21.136304] [<ffffffff8023e1fe>] printk+0x4e/0x60
[ 21.136306] [<ffffffff8023e1fe>] printk+0x4e/0x60
[ 21.136309] [<ffffffff8036b520>] match_pci_dev_by_id+0x0/0x60
[ 21.136313] [<ffffffff8024360e>] iomem_map_sanity_check+0xbe/0xd0
[ 21.136316] [<ffffffff80229799>] __ioremap_caller+0x359/0x390
[ 21.136320] [<ffffffffa01eb1f6>] init_ck804xrom+0x1f6/0x62c [ck804xrom]
[ 21.136322] [<ffffffffa01eb1f6>] init_ck804xrom+0x1f6/0x62c [ck804xrom]
[ 21.136326] [<ffffffff80275eac>] tracepoint_update_probe_range+0x1c/0xb0
[ 21.136329] [<ffffffffa01eb000>] init_ck804xrom+0x0/0x62c [ck804xrom]
[ 21.136332] [<ffffffff8020903b>] _stext+0x3b/0x160
[ 21.136335] [<ffffffff80359141>] __up_read+0x21/0xb0
[ 21.136340] [<ffffffff80256495>]
__blocking_notifier_call_chain+0x65/0x90
[ 21.136343] [<ffffffff80265604>] sys_init_module+0xb4/0x200
[ 21.136346] [<ffffffff8020c35b>] system_call_fastpath+0x16/0x1b
[ 21.136348] ---[ end trace f807e12658961c2d ]---

System is fully operational, but I didnt get it in 2.6.26.8 (most recent
kernel tried on this hardware).

--
Jesper

Marcin Slusarz

unread,

Feb 26, 2009, 12:17:53 PM2/26/09

to Jesper Krogh, Linus Torvalds, linux-...@vger.kernel.org, Dave Olsen, Ryan Jackson, David.W...@intel.com, linu...@lists.infradead.org

This message comes from this code in drivers/mtd/maps/ck804xrom.c:
/*
* Try to reserve the window mem region. If this fails then
* it is likely due to a fragment of the window being
* "reserved" by the BIOS. In the case that the
* request_mem_region() fails then once the rom size is
* discovered we will try to reserve the unreserved fragment.
*/
window->rsrc.name = MOD_NAME;
window->rsrc.start = window->phys;
window->rsrc.end = window->phys + window->size - 1;
window->rsrc.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
if (request_resource(&iomem_resource, &window->rsrc)) {
window->rsrc.parent = NULL;
printk(KERN_ERR MOD_NAME
" %s(): Unable to register resource"
" 0x%.016llx-0x%.016llx - kernel bug?\n",
__func__,
(unsigned long long)window->rsrc.start,
(unsigned long long)window->rsrc.end);
}

So it's probably harmless.
Adding CC's.

Marcin

Linus Torvalds

unread,

Feb 26, 2009, 12:53:43 PM2/26/09

to Jesper Krogh, David Woodhouse, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

Dave Olsen <dol...@lnxi.com>,
Ryan Jackson <rjac...@lnxi.com>, David.W...@intel.com,
linu...@lists.infradead.org

On Thu, 26 Feb 2009, Jesper Krogh wrote:
>
>
> Booting up 2.6.29-rc6 gave me this one in dmesg...
>
> [ 21.136149] ck804xrom ck804xrom_init_one(): Unable to register resource 0x00000000ff000000-0x00000000ffffffff - kernel bug?

Well, it _is_ a kernel bug, but it's in that stupid driver. It does
everything wrong, including printing out a scary message.

Piece of sh*t driver, in other words.

I mean, it even has a _comment_ about how the request_region is likely to
not succeed, and then it prints out that scary message when it
then doesn't do so.

Not to mention that the driver is likely _wrong_ to just unconditionally
try to enable that resource without *first* checking whether the resource
can actually be enabled or whether there are other resources in that same
window.

Quite frankly, I find that whole thing scary. The driver should be deleted
or at least marked EXPERIMENTAL or BROKEN.

It has a "BE VERY CAREFUL" in the Kconfig _help_ text, but is not marked
as being dangerous any other way.

That said, I really don't see why you would get this message _now_. The
total braindamage of that driver in no way seems new. Did you perhaps not
notice before, or did you just not enable it before?

> [ 21.136269] WARNING: at arch/x86/mm/ioremap.c:208 __ioremap_caller+0x359/0x390()

This is a different, but related warning, since the driver is doing an
ioremap across different resources. The warning is directly related to the
fact that the resource wasn't actually valid to begin with.

What does "cat /proc/iomem" say?

> System is fully operational, but I didnt get it in 2.6.26.8 (most recent
> kernel tried on this hardware).

The ioremap() warning is newish, and may be what made you notice the
previous (just one-line) crappy warning.

Quite frankly, having looked at that horrible driver, I would seriously
consider disabling it. Stuff like that should not be allowed to exist.

Linus

David Woodhouse

unread,

Feb 26, 2009, 2:22:26 PM2/26/09

to Linus Torvalds, Jesper Krogh, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

On Thu, 2009-02-26 at 17:53 +0000, Linus Torvalds wrote:
> Dave Olsen <dol...@lnxi.com>,
> Ryan Jackson <rjac...@lnxi.com>, David.W...@intel.com,
> linu...@lists.infradead.org
>
>
> On Thu, 26 Feb 2009, Jesper Krogh wrote:
> >
> >
> > Booting up 2.6.29-rc6 gave me this one in dmesg...
> >
> > [ 21.136149] ck804xrom ck804xrom_init_one(): Unable to register resource 0x00000000ff000000-0x00000000ffffffff - kernel bug?
>
> Well, it _is_ a kernel bug, but it's in that stupid driver. It does
> everything wrong, including printing out a scary message.
>
> Piece of sh*t driver, in other words.
>
> I mean, it even has a _comment_ about how the request_region is likely to
> not succeed, and then it prints out that scary message when it
> then doesn't do so.
>
> Not to mention that the driver is likely _wrong_ to just unconditionally
> try to enable that resource without *first* checking whether the resource
> can actually be enabled or whether there are other resources in that same
> window.
>
> Quite frankly, I find that whole thing scary. The driver should be deleted
> or at least marked EXPERIMENTAL or BROKEN.

It's giving you access to your BIOS flash so that you can overwrite it
from within Linux. It's _supposed_ to be scary :)

It's also always going to be a hack -- it's a PITA getting direct access
to that flash on most PeeCee chipsets. The driver operates on the
principle that it knows the hardware, and it can _make_ the flash appear
at the appropriate physical addresses. The theory, at least, is that it
knows better than the kernel does.

But yeah, it should probably at least look for other things which
already overlap with the region that it's trying to 'create'. Although
the comment leads me to believe that sometimes that's _expected_ and
shouldn't cause the driver to abort.

Dave, Ryan, are you still actively using this?

--
David Woodhouse Open Source Technology Centre
David.W...@intel.com Intel Corporation

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Jesper Krogh

unread,

Feb 26, 2009, 2:32:09 PM2/26/09

to Linus Torvalds, David Woodhouse, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

Linus Torvalds wrote:
> Dave Olsen <dol...@lnxi.com>,
> Ryan Jackson <rjac...@lnxi.com>, David.W...@intel.com,
> linu...@lists.infradead.org
>
>
> On Thu, 26 Feb 2009, Jesper Krogh wrote:
>>
>> Booting up 2.6.29-rc6 gave me this one in dmesg...
>>
>> [ 21.136149] ck804xrom ck804xrom_init_one(): Unable to register resource 0x00000000ff000000-0x00000000ffffffff - kernel bug?
>
> Well, it _is_ a kernel bug, but it's in that stupid driver. It does
> everything wrong, including printing out a scary message.

I've seen that before.. (even reported it before). It just "slipped"
into the cut'n'paste It was the following stuff that I intended to report.

>> [ 21.136269] WARNING: at arch/x86/mm/ioremap.c:208 __ioremap_caller+0x359/0x390()
>
> This is a different, but related warning, since the driver is doing an
> ioremap across different resources. The warning is directly related to the
> fact that the resource wasn't actually valid to begin with.
>
> What does "cat /proc/iomem" say?

http://krogh.cc/~jesper/iomem.txt

>> System is fully operational, but I didnt get it in 2.6.26.8 (most recent
>> kernel tried on this hardware).
>
> The ioremap() warning is newish, and may be what made you notice the
> previous (just one-line) crappy warning.
>
> Quite frankly, having looked at that horrible driver, I would seriously
> consider disabling it. Stuff like that should not be allowed to exist.

Being a "stupid" user, I pick the easy way to build a fresh kernel:
1) pick the distro .config
2) make oldconfig
3) Let the kernel load what it think it needs.
4) Report if I see and strange stuff (warnings / bugs / oops) or
misbehaviour.

So I dont know if I need that driver for anything vital. Should I care?
Or shouldn't it "just work"?

--
Jesper

David Woodhouse

unread,

Feb 26, 2009, 2:37:45 PM2/26/09

to Jesper Krogh, Linus Torvalds, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

On Thu, 2009-02-26 at 19:31 +0000, Jesper Krogh wrote:
> 1) pick the distro .config
> 2) make oldconfig

So it should have been a module, not built-in?

> 3) Let the kernel load what it think it needs.

That part at least ought to be disabled -- we don't let this driver
autoload, because unless you _know_ you need it, you don't need it.

It's for overwriting your BIOS.

--
David Woodhouse Open Source Technology Centre
David.W...@intel.com Intel Corporation

--

Jesper Krogh

unread,

Feb 26, 2009, 2:47:21 PM2/26/09

to David Woodhouse, Linus Torvalds, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

David Woodhouse wrote:
> On Thu, 2009-02-26 at 19:31 +0000, Jesper Krogh wrote:
>> 1) pick the distro .config
>> 2) make oldconfig
>
> So it should have been a module, not built-in?

It is a module.. and it somehow gets auto-loaded on my system. (not
listed in /etc/modules).

$ grep -i ck804xrom /boot/config-2.6.29-rc6
CONFIG_MTD_CK804XROM=m

Same in the distro .config
$ grep -i ck804xrom /boot/config-2.6.24-23-server
CONFIG_MTD_CK804XROM=m

>> 3) Let the kernel load what it think it needs.
>
> That part at least ought to be disabled -- we don't let this driver
> autoload, because unless you _know_ you need it, you don't need it.
>
> It's for overwriting your BIOS.

Oh. Thanks for your time... I'll just make sure to disable it from now on.

--
Jesper

David Woodhouse

unread,

Feb 26, 2009, 2:50:32 PM2/26/09

to Jesper Krogh, Linus Torvalds, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

On Thu, 2009-02-26 at 20:46 +0100, Jesper Krogh wrote:
> It is a module.. and it somehow gets auto-loaded on my system. (not
> listed in /etc/modules).

Oops, we should have disabled that, but it still has a
MODULE_DEVICE_TABLE(). I'll remove that, for a start...

--
David Woodhouse Open Source Technology Centre
David.W...@intel.com Intel Corporation

--

Jesper Krogh

unread,

Feb 26, 2009, 2:55:44 PM2/26/09

to Linus Torvalds, Linux Kernel Mailing List

2.6.29-rc6 seems to have trouble running ntpd reliable under load. My
nagios system has just alerted me of drifting time on the machine upgraded.

Feb 26 19:09:25 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 26 19:10:31 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
Feb 26 19:25:21 quad12 ntpd[4901]: time reset -0.915488 s
Feb 26 19:29:11 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 26 19:31:21 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 26 19:34:37 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
Feb 26 19:37:53 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 26 19:46:27 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
Feb 26 19:46:27 quad12 ntpd[4901]: time reset -0.961386 s
Feb 26 19:50:30 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 26 19:51:34 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
Feb 26 20:01:55 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 26 20:06:18 quad12 ntpd[4901]: time reset -0.979177 s
Feb 26 20:10:15 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 26 20:11:21 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 26 20:14:52 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
Feb 26 20:19:10 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 26 20:26:00 quad12 ntpd[4901]: time reset -0.923268 s
Feb 26 20:30:01 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 26 20:30:30 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 26 20:45:36 quad12 ntpd[4901]: time reset -0.919609 s
Feb 26 20:49:49 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13

2.6.26.8 doesnt have this problem.

The "current_clocsource" is the same on both systems.

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

--
Jesper

Linus Torvalds

unread,

Feb 26, 2009, 3:32:31 PM2/26/09

to Jesper Krogh, David Woodhouse, Dave Olsen, Ryan Jackson, linu...@lists.infradead.org, Linux Kernel Mailing List

On Thu, 26 Feb 2009, Jesper Krogh wrote:

> Linus Torvalds wrote:
> > On Thu, 26 Feb 2009, Jesper Krogh wrote:
> > >
> > > Booting up 2.6.29-rc6 gave me this one in dmesg...
> > >
> > > [ 21.136149] ck804xrom ck804xrom_init_one(): Unable to register resource
> > > 0x00000000ff000000-0x00000000ffffffff - kernel bug?
> >
> > Well, it _is_ a kernel bug, but it's in that stupid driver. It does
> > everything wrong, including printing out a scary message.
>
> I've seen that before.. (even reported it before). It just "slipped" into the
> cut'n'paste It was the following stuff that I intended to report.

Ok. They very much are related. The new warning is just that - a new
warning.

> > > [ 21.136269] WARNING: at arch/x86/mm/ioremap.c:208
> > > __ioremap_caller+0x359/0x390()
> >
> > This is a different, but related warning, since the driver is doing an
> > ioremap across different resources. The warning is directly related to the
> > fact that the resource wasn't actually valid to begin with.
> >
> > What does "cat /proc/iomem" say?
>
> http://krogh.cc/~jesper/iomem.txt

Ok, so the thing conflicts with

ff700000-ffffffff : reserved
ff700000-ffffffff : pnp 00:0b

and that probably _is_ somehow related to the whole flash thing.

I guess the driver could use "insert_resource()" and the problem would go
away. Except I do think it should be marked very dangerous some way, so
that you can't even enable it unless you really really know you want to
(eg something like EXPERIMENTAL). Because I don't think this driver is
appropriate in any other case..

> Being a "stupid" user, I pick the easy way to build a fresh kernel: 1)
> pick the distro .config 2) make oldconfig 3) Let the kernel load what it
> think it needs. 4) Report if I see and strange stuff (warnings / bugs /
> oops) or misbehaviour.
>
> So I dont know if I need that driver for anything vital. Should I care?
> Or shouldn't it "just work"?

You definitely don't need it, and everything will work without it.

Linus

Jesper Krogh

unread,

Feb 26, 2009, 3:43:40 PM2/26/09

to Linus Torvalds, Linux Kernel Mailing List

Linus Torvalds wrote:
>
> On Thu, 26 Feb 2009, Jesper Krogh wrote:

>> 2.6.26.8 doesnt have this problem.
>>
>> The "current_clocsource" is the same on both systems.
>>
>> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>> tsc
>

> What does the frequency calibrate to? It should be in the dmesg. Does it
> differ by a big amount?

Non-working:
$ dmesg | grep -i freq
[ 0.004007] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4620.05 BogoMIPS (lpj=9240104)

2.6.26.8 doesn't have that information.

Carl-Daniel Hailfinger

unread,

Feb 26, 2009, 3:54:24 PM2/26/09

to David Woodhouse, Jesper Krogh, Ryan Jackson, linu...@lists.infradead.org, Dave Olsen, Linus Torvalds, Linux Kernel Mailing List

On 26.02.2009 20:36, David Woodhouse wrote:
> It's for overwriting your BIOS.
>

There's a pure userspace replacement for it. That replacement is even
packaged in most distros. See http://www.coreboot.org/Flashrom .

Regards,
Carl-Daniel

--
http://www.hailfinger.org/

john stultz

unread,

Feb 26, 2009, 4:19:44 PM2/26/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Thu, Feb 26, 2009 at 12:43 PM, Jesper Krogh <jes...@krogh.cc> wrote:
> Linus Torvalds wrote:
>>
>> On Thu, 26 Feb 2009, Jesper Krogh wrote:
>>>
>>> 2.6.26.8 doesnt have this problem.
>>>
>>> The "current_clocsource" is the same on both systems.
>>>
>>> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>>> tsc
>>
>> What does the frequency calibrate to? It should be in the dmesg. Does it
>> differ by a big amount?
>
> Non-working:
> $ dmesg | grep -i freq
> [ 0.004007] Calibrating delay loop (skipped), value calculated using
> timer frequency.. 4620.05 BogoMIPS (lpj=9240104)
>
> 2.6.26.8 doesn't have that information.

I'm surprised the clocksource watchdog isn't catching it.

What's the output from:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Also mind sending the full dmesg for both kernels?

thanks
-john

Jesper Krogh

unread,

Feb 26, 2009, 4:35:59 PM2/26/09

to john stultz, Linus Torvalds, Linux Kernel Mailing List

john stultz wrote:
> On Thu, Feb 26, 2009 at 12:43 PM, Jesper Krogh <jes...@krogh.cc> wrote:
>> Linus Torvalds wrote:
>>> On Thu, 26 Feb 2009, Jesper Krogh wrote:
>>>> 2.6.26.8 doesnt have this problem.
>>>>
>>>> The "current_clocsource" is the same on both systems.
>>>>
>>>> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>>>> tsc
>>> What does the frequency calibrate to? It should be in the dmesg. Does it
>>> differ by a big amount?
>> Non-working:
>> $ dmesg | grep -i freq
>> [ 0.004007] Calibrating delay loop (skipped), value calculated using
>> timer frequency.. 4620.05 BogoMIPS (lpj=9240104)
>>
>> 2.6.26.8 doesn't have that information.
>
> I'm surprised the clocksource watchdog isn't catching it.
>
> What's the output from:
> cat /sys/devices/system/clocksource/clocksource0/available_clocksource

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc acpi_pm jiffies

Same on both.

> Also mind sending the full dmesg for both kernels?

http://krogh.cc/~jesper/dmesg-2.6.29-rc6.txt
http://krogh.cc/~jesper/dmesg-2.6.26.8.txt

--
Jesper

john stultz

unread,

Feb 26, 2009, 4:47:38 PM2/26/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner

On Thu, 2009-02-26 at 22:35 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > On Thu, Feb 26, 2009 at 12:43 PM, Jesper Krogh <jes...@krogh.cc> wrote:
> >> Linus Torvalds wrote:
> >>> On Thu, 26 Feb 2009, Jesper Krogh wrote:
> >>>> 2.6.26.8 doesnt have this problem.
> >>>>
> >>>> The "current_clocsource" is the same on both systems.
> >>>>
> >>>> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> >>>> tsc
> >>> What does the frequency calibrate to? It should be in the dmesg. Does it
> >>> differ by a big amount?
> >> Non-working:
> >> $ dmesg | grep -i freq
> >> [ 0.004007] Calibrating delay loop (skipped), value calculated using
> >> timer frequency.. 4620.05 BogoMIPS (lpj=9240104)
> >>
> >> 2.6.26.8 doesn't have that information.
> >
> > I'm surprised the clocksource watchdog isn't catching it.
> >
> > What's the output from:
> > cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>
> $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
> tsc acpi_pm jiffies

Hmm. Does booting w/ "clocksourc=acpi_pm" also show the severe (~550ppm,
which NTP can't handle) drift?

>From the dmesg, I don't see any major calibration difference right off.

So I'd suspect something like TSC halting in idle could be causing
problems, but the watchdog should catch that as well. My only guess at
this point is that the ACPI PM is halting in idle along with the TSC.

And you said this only happens under load?

-john

Linus Torvalds

unread,

Feb 26, 2009, 4:50:27 PM2/26/09

to Jesper Krogh, john stultz, Linux Kernel Mailing List

On Thu, 26 Feb 2009, Jesper Krogh wrote:
>
> > Also mind sending the full dmesg for both kernels?
>
> http://krogh.cc/~jesper/dmesg-2.6.29-rc6.txt
> http://krogh.cc/~jesper/dmesg-2.6.26.8.txt

Try changing

#define QUICK_PIT_MS 15

in arch/x86/kernel/tsc.c into something bigger. Let's say just doubling
it to 30. Does that change anything?

Linus

john stultz

unread,

Feb 26, 2009, 4:54:55 PM2/26/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner, Len Brown

On Thu, 2009-02-26 at 22:35 +0100, Jesper Krogh wrote:

> > Also mind sending the full dmesg for both kernels?
>
> http://krogh.cc/~jesper/dmesg-2.6.29-rc6.txt
> http://krogh.cc/~jesper/dmesg-2.6.26.8.txt

So one interesting difference:
2.6.26.8: TSC calibrated against PM_TIMER
2.6.29-rc6: Fast TSC calibration using PIT

Thomas, any thoughts as to why we might be calibrating off the PIT
instead of the PM_TIMER w/ 2.6.29?

Maybe does this line provide a hint?
FADT: X_PM1a_EVT_BLK.bit_width (16) does not match PM1_EVT_LEN (4)

thanks
-john

Thomas Gleixner

unread,

Feb 26, 2009, 4:55:31 PM2/26/09

to john stultz, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

But why would it do that on 29-rc6 and not on 2.6.28.8 ? I'm not aware
of changes which might cause that.

Thanks,

tglx

Jesper Krogh

unread,

Feb 26, 2009, 5:04:29 PM2/26/09

to Thomas Gleixner, john stultz, Linus Torvalds, Linux Kernel Mailing List

My comparison is 2.6.26.8 not 2.6.28.8 .. so fairly old.

It is a small cluster, so I'm slipping some test-kernels in when the
cluster is idle.

--
Jesper

Thomas Gleixner

unread,

Feb 26, 2009, 5:07:18 PM2/26/09

to john stultz, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Thu, 26 Feb 2009, john stultz wrote:
> On Thu, 2009-02-26 at 22:35 +0100, Jesper Krogh wrote:
> > > Also mind sending the full dmesg for both kernels?
> >
> > http://krogh.cc/~jesper/dmesg-2.6.29-rc6.txt
> > http://krogh.cc/~jesper/dmesg-2.6.26.8.txt
>
> So one interesting difference:
> 2.6.26.8: TSC calibrated against PM_TIMER
> 2.6.29-rc6: Fast TSC calibration using PIT
>
> Thomas, any thoughts as to why we might be calibrating off the PIT
> instead of the PM_TIMER w/ 2.6.29?

Yup, because we introduced the Fast PIT calibration in 2.6.28.

Is the delta anything NTP might get upset about:

2.6.26: time.c: Detected 2311.847 MHz processor.
2.6.29: Detected 2310.029 MHz processor.

If yes, then we need to fix NTP not the calibration code :)

> Maybe does this line provide a hint?
> FADT: X_PM1a_EVT_BLK.bit_width (16) does not match PM1_EVT_LEN (4)

Red herring.

Thanks,

tglx

Linus Torvalds

unread,

Feb 26, 2009, 5:25:20 PM2/26/09

to Thomas Gleixner, john stultz, Jesper Krogh, Linux Kernel Mailing List, Len Brown

On Thu, 26 Feb 2009, Thomas Gleixner wrote:
>
> Is the delta anything NTP might get upset about:
>
> 2.6.26: time.c: Detected 2311.847 MHz processor.
> 2.6.29: Detected 2310.029 MHz processor.
>
> If yes, then we need to fix NTP not the calibration code :)

Well, that _is_ about 500ppm difference, and we claim that we _should_
have reached 150ppm with the 15ms delay. We clearly don't seem to have
done that. I'm not quite sure why - we _should_ be finding the edge of the
PIT events to within roughly a microsecond (assuming that's about as long
as an "inb" takes), and that should give us a pretty good fast
calibration, but maybe I'm overlooking something.

Or - and this may be more likely - there are chipsets that aren't very
good at reading the PIT in a tight loop. That may explain why it's a
problem on Jesper's hardware, but we haven't gotten tons of reports of
this from others.

I see that it's a SunFire X2200, which I think uses an nVidia HT
southbridge. I assume it's an nForce4 thing. There shouldn't be anything
odd there, and the PIT read shouldn't be taking any longer than on
anything else, but who knows?

Linus

john stultz

unread,

Feb 26, 2009, 5:31:53 PM2/26/09

to Thomas Gleixner, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Thu, 2009-02-26 at 23:06 +0100, Thomas Gleixner wrote:
> On Thu, 26 Feb 2009, john stultz wrote:
> > On Thu, 2009-02-26 at 22:35 +0100, Jesper Krogh wrote:
> > > > Also mind sending the full dmesg for both kernels?
> > >
> > > http://krogh.cc/~jesper/dmesg-2.6.29-rc6.txt
> > > http://krogh.cc/~jesper/dmesg-2.6.26.8.txt
> >
> > So one interesting difference:
> > 2.6.26.8: TSC calibrated against PM_TIMER
> > 2.6.29-rc6: Fast TSC calibration using PIT
> >
> > Thomas, any thoughts as to why we might be calibrating off the PIT
> > instead of the PM_TIMER w/ 2.6.29?
>
> Yup, because we introduced the Fast PIT calibration in 2.6.28.

Ah. Ok.

> Is the delta anything NTP might get upset about:
>
> 2.6.26: time.c: Detected 2311.847 MHz processor.
> 2.6.29: Detected 2310.029 MHz processor.

I wouldn't think so.

Although, I'm recalling on some systems here right after we deploy them
we'll see something similar to the originally reported ntpd "time reset"
noise for a period of time while ntpd tries to find the right freq. For
some reason, I've noticed, having multiple servers in your ntp.conf
seems to increase NTP's difficulty at picking a time and converging.

So this may be just the slight calibration change is confusing ntp or it
may be the NTP_INTERVAL_LENGTH change from awhile back which would cause
the drift value to change could be doing the same thing (although I
thought that landed in the 2.6.24 timeframe, but I may be forgetting).

I'll kick up some of my own testing between these two releases to see if
I can't find something similar.

Jesper: How long was the box up for when you noticed the ntpd noise?

Also what's the output of the following under the different kernels:
ntpdc -c peers
ntpdc -c kerninfo

thanks
-john

Linus Torvalds

unread,

Feb 26, 2009, 5:32:26 PM2/26/09

to Thomas Gleixner, john stultz, Jesper Krogh, Linux Kernel Mailing List, Len Brown

On Thu, 26 Feb 2009, Linus Torvalds wrote:

>
>
> On Thu, 26 Feb 2009, Thomas Gleixner wrote:
> >
> > Is the delta anything NTP might get upset about:
> >
> > 2.6.26: time.c: Detected 2311.847 MHz processor.
> > 2.6.29: Detected 2310.029 MHz processor.
> >
> > If yes, then we need to fix NTP not the calibration code :)
>
> Well, that _is_ about 500ppm difference

Doing the math rather than just eyeballing it, I think it's closer to
800ppm than 500ppm. But maybe I did that wrong too.

Which is definitely pretty far out. The theory is that if we can catch the
edge of the PIT timer to 1us, and even if we get it maximally wrong at
beginning/end (ie the difference is off by 2us), a 2us error over 15ms
should be on the order of just a 133ppm error.

So 800ppm looks too big. We're clearly not getting to within 1us of the
PIT timer event edge. But it would be interesting to hear whether making
teh 15ms be 30ms will get us to a better place, and make ntp happier.

And maybe my math is just wrong, and it's not the "within 1us" assumption
that was wrong.

Linus Torvalds

unread,

Feb 26, 2009, 5:41:33 PM2/26/09

to john stultz, Thomas Gleixner, Jesper Krogh, Linux Kernel Mailing List, Len Brown

On Thu, 26 Feb 2009, john stultz wrote:
>
> I'll kick up some of my own testing between these two releases to see if
> I can't find something similar.

Since the PIT timer read is possibly hw-dependent, it might be that you
can't necessarily reproduce it on some random hardware.

How sensitive is ntpd to (stable) drift? IOW, if we get the calibration
wrong, the TSC should still hopefully be very _stable_, it's just that the
initial guesstimate for the frequency is off and ntp would have to correct
for that.

The easiest way to test might be to just force a 1000ppm estimation error
with something like this total hack (indented just so that nobody would
ever apply this by mistake):

arch/x86/kernel/tsc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..b80a0c4 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -350,6 +350,10 @@ static unsigned long quick_pit_calibrate(void)
delta = (t2 - t1)*PIT_TICK_RATE;
do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
printk("Fast TSC calibration using PIT\n");
+
+ /* HACK! */
+ delta -= delta >> 10;
+
return delta;
}
failed:

which wouldn't be hardware-dependent.

Linus

john stultz

unread,

Feb 26, 2009, 6:00:05 PM2/26/09

to Linus Torvalds, Thomas Gleixner, Jesper Krogh, Linux Kernel Mailing List, Len Brown

On Thu, 2009-02-26 at 14:40 -0800, Linus Torvalds wrote:
>
> On Thu, 26 Feb 2009, john stultz wrote:
> >
> > I'll kick up some of my own testing between these two releases to see if
> > I can't find something similar.
>
> Since the PIT timer read is possibly hw-dependent, it might be that you
> can't necessarily reproduce it on some random hardware.
>
> How sensitive is ntpd to (stable) drift? IOW, if we get the calibration
> wrong, the TSC should still hopefully be very _stable_, it's just that the
> initial guesstimate for the frequency is off and ntp would have to correct
> for that.

NTP can adjust the clock about +/-500ppm (so a 1000ppm range). Past that
it starts throwing errors.

Part of the issue is that if the drift value changes in between boots,
NTPd can take a while to settle down on the right freq. I suspect that's
whats happening here, and should the box be left alone for a few hours
(maybe overnight) NTPd will find the new drift correction the issue will
go away.

Thomas tripped over this a little while back when the
NTP_INTERVAL_LENGTH change landed, but I think that was prior to 2.6.26,
so its probably the calibration changes discussed, but I wanted to see
if there were any other slight changes that might be contributing to the
issue as well.

thanks
-john

Jesper Krogh

unread,

Feb 27, 2009, 1:31:32 AM2/27/09

to john stultz, Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner

john stultz wrote:
> On Thu, 2009-02-26 at 22:35 +0100, Jesper Krogh wrote:
>> john stultz wrote:
>>> On Thu, Feb 26, 2009 at 12:43 PM, Jesper Krogh <jes...@krogh.cc> wrote:
>>>> Linus Torvalds wrote:
>>>>> On Thu, 26 Feb 2009, Jesper Krogh wrote:
>>>>>> 2.6.26.8 doesnt have this problem.
>>>>>>
>>>>>> The "current_clocsource" is the same on both systems.
>>>>>>
>>>>>> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>>>>>> tsc
>>>>> What does the frequency calibrate to? It should be in the dmesg. Does it
>>>>> differ by a big amount?
>>>> Non-working:
>>>> $ dmesg | grep -i freq
>>>> [ 0.004007] Calibrating delay loop (skipped), value calculated using
>>>> timer frequency.. 4620.05 BogoMIPS (lpj=9240104)
>>>>
>>>> 2.6.26.8 doesn't have that information.
>>> I'm surprised the clocksource watchdog isn't catching it.
>>>
>>> What's the output from:
>>> cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>> $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>> tsc acpi_pm jiffies
>
> Hmm. Does booting w/ "clocksourc=acpi_pm" also show the severe (~550ppm,
> which NTP can't handle) drift?

I booted another server (identical hardware) with the same kernel and
the above clocksource line, it has run over night (8 hours) with full
load and ntp has not complained about anything on that server.

>>From the dmesg, I don't see any major calibration difference right off.
>
> So I'd suspect something like TSC halting in idle could be causing
> problems, but the watchdog should catch that as well. My only guess at
> this point is that the ACPI PM is halting in idle along with the TSC.
>
> And you said this only happens under load?

I cant say that, but I've only observed it under load.

--
Jesper

Jesper Krogh

unread,

Feb 27, 2009, 1:47:58 AM2/27/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

I was booted Feb 25 21:58 .. the first noice from ntp starts here:
Feb 25 22:09:53 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 25 22:09:56 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 25 22:14:08 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 25 22:16:20 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
Feb 25 22:32:25 quad12 ntpd[4901]: time reset -1.601641 s
Feb 25 22:36:18 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
Feb 25 22:36:45 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
Feb 25 22:51:41 quad12 ntpd[4901]: time reset -0.922993 s
Feb 25 22:55:05 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13

> Also what's the output of the following under the different kernels:
> ntpdc -c peers
> ntpdc -c kerninfo

Working (clocksource=acpi_pm) 2.6.29-rc6
jk@quad02:~$ ntpdc -c kerninfo
pll offset: -0.001577 s
pll frequency: -45.787 ppm
maximum error: 0.066739 s
estimated error: 0.000768 s
status: 0001 pll
pll time constant: 6
precision: 1e-06 s
frequency tolerance: 500 ppm
jk@quad02:~$ ntpdc -c peers
remote local st poll reach delay offset disp
=======================================================================
*hal.nzcorp.net 10.194.132.81 4 64 377 0.00008 0.003752 0.04816
=svn.nzcorp.net 10.194.132.81 4 64 377 0.00009 -0.008724 0.04979
=LOCAL(0) 127.0.0.1 13 64 377 0.00000 0.000000 0.03082

Working (clocksource=tsc) 2.6.26.8
jk@quad03:~$ ntpdc -c kerninfo
pll offset: 0.003208 s
pll frequency: -25.070 ppm
maximum error: 0.833193 s
estimated error: 0.002787 s
status: 4001 pll
pll time constant: 10
precision: 1e-06 s
frequency tolerance: 500 ppm
jk@quad03:~$ ntpdc -c peers
remote local st poll reach delay offset disp
=======================================================================
*hal.nzcorp.net 10.194.132.82 4 1024 377 0.00781 0.006788 0.13666
=sal.nzcorp.net 10.194.132.82 4 1024 377 0.00018 -0.000541 0.12175
=LOCAL(0) 127.0.0.1 13 64 377 0.00000 0.000000 0.03041

Non-working (clocksource=tsc) 2.6.29-rc6
jk@quad12:~$ ntpdc -c kerninfo
pll offset: 0 s
pll frequency: -34.754 ppm
maximum error: 0.023514 s
estimated error: 0 s
status: 0001 pll
pll time constant: 6
precision: 1e-06 s
frequency tolerance: 500 ppm
jk@quad12:~$ ntpdc -c peers
remote local st poll reach delay offset disp
=======================================================================
=hal.nzcorp.net 10.194.132.91 4 64 17 0.00011 -0.069377 0.96895
=trac.nzcorp.net 10.194.132.91 4 64 17 0.00011 -0.096107 0.96904
*LOCAL(0) 127.0.0.1 13 64 17 0.00000 0.000000 0.96857

Ingo Molnar

unread,

Feb 27, 2009, 2:33:53 AM2/27/09

to john stultz, Linus Torvalds, Thomas Gleixner, Jesper Krogh, Linux Kernel Mailing List, Len Brown

* john stultz <john...@us.ibm.com> wrote:

> On Thu, 2009-02-26 at 14:40 -0800, Linus Torvalds wrote:
> >
> > On Thu, 26 Feb 2009, john stultz wrote:
> > >
> > > I'll kick up some of my own testing between these two releases to see if
> > > I can't find something similar.
> >
> > Since the PIT timer read is possibly hw-dependent, it might be that you
> > can't necessarily reproduce it on some random hardware.
> >
> > How sensitive is ntpd to (stable) drift? IOW, if we get the calibration
> > wrong, the TSC should still hopefully be very _stable_, it's just that the
> > initial guesstimate for the frequency is off and ntp would have to correct
> > for that.
>
> NTP can adjust the clock about +/-500ppm (so a 1000ppm range).
> Past that it starts throwing errors.

Well, it will start throwing errors but still it will correct
the clock and find the frequency delta between the host clock
and the reference clock just fine, and converge in a couple of
hours, correct?

500ppm is 0.05% of a frequency drift which is awfully small -
thermal effects alone can cause such differences so it should
not be anything out of the ordinary for ntpd.

> Part of the issue is that if the drift value changes in
> between boots, NTPd can take a while to settle down on the
> right freq. I suspect that's whats happening here, and should
> the box be left alone for a few hours (maybe overnight) NTPd
> will find the new drift correction the issue will go away.

If the default poll interval of 64 seconds is used then it can
take that much time - so i'd sugges to decrease that to below 10
seconds.

It's not like the frequency is changing rapidly here. The
correction pattern to find is a very simple and very static and
reliable multiplicator of ~1.000800 between the two frequencies.

Say the over-the-network reference clock ntpd follows has a 10
msecs of intrinsic observation noise. For that 10 msecs noise to
go down to the 10 ppm range [to the local but drifted time
source which has ~10 ppm precision straight away], we need
roughly 1000 samples. [simplified, fewer are enough in reality,
especially if you have some known-to-have-converged-before
cached value to start out with.]

1000 samples with 64 seconds intervals can take half a day to
converge. 1000 samples with 1 second intervals takes just 15
minutes to converge.

We'll improve in-kernel calibration but calibration noise in the
0.05% range should be expected in some cases.

Ingo

john stultz

unread,

Feb 27, 2009, 3:38:26 PM2/27/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Fri, 2009-02-27 at 07:47 +0100, Jesper Krogh wrote:
> john stultz wrote:

> > Jesper: How long was the box up for when you noticed the ntpd noise?
>
> I was booted Feb 25 21:58 .. the first noice from ntp starts here:
> Feb 25 22:09:53 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
> Feb 25 22:09:56 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
> Feb 25 22:14:08 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
> Feb 25 22:16:20 quad12 ntpd[4901]: synchronized to 10.194.133.13, stratum 4
> Feb 25 22:32:25 quad12 ntpd[4901]: time reset -1.601641 s
> Feb 25 22:36:18 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13
> Feb 25 22:36:45 quad12 ntpd[4901]: synchronized to 10.194.133.12, stratum 4
> Feb 25 22:51:41 quad12 ntpd[4901]: time reset -0.922993 s
> Feb 25 22:55:05 quad12 ntpd[4901]: synchronized to LOCAL(0), stratum 13

Ok, so that's not very long. I'd expect by now, if the box is still up,
the messages have stopped. Is that true, or is it still resetting?

> > Also what's the output of the following under the different kernels:
> > ntpdc -c peers
> > ntpdc -c kerninfo

[snip]

> Working (clocksource=tsc) 2.6.26.8
> jk@quad03:~$ ntpdc -c kerninfo
> pll offset: 0.003208 s
> pll frequency: -25.070 ppm

[snip]

> Non-working (clocksource=tsc) 2.6.29-rc6
> jk@quad12:~$ ntpdc -c kerninfo
> pll offset: 0 s
> pll frequency: -34.754 ppm

Ok, so it seems ntp hasn't really had a chance to settle down, its only
made a 10ppm adjustment so far. NTPd will stop corrections at ~
+/-500ppm, so you're not at that bound yet, where things would be really
broken.

If the affected kernel isn't resetting in the logs anymore, I'd be
interested in what the new ppm value is.

thanks
-john

john stultz

unread,

Feb 27, 2009, 3:50:35 PM2/27/09

to Ingo Molnar, Linus Torvalds, Thomas Gleixner, Jesper Krogh, Linux Kernel Mailing List, Len Brown

On Fri, 2009-02-27 at 08:33 +0100, Ingo Molnar wrote:
> * john stultz <john...@us.ibm.com> wrote:
>
> > On Thu, 2009-02-26 at 14:40 -0800, Linus Torvalds wrote:
> > >
> > > On Thu, 26 Feb 2009, john stultz wrote:
> > > >
> > > > I'll kick up some of my own testing between these two releases to see if
> > > > I can't find something similar.
> > >
> > > Since the PIT timer read is possibly hw-dependent, it might be that you
> > > can't necessarily reproduce it on some random hardware.
> > >
> > > How sensitive is ntpd to (stable) drift? IOW, if we get the calibration
> > > wrong, the TSC should still hopefully be very _stable_, it's just that the
> > > initial guesstimate for the frequency is off and ntp would have to correct
> > > for that.
> >
> > NTP can adjust the clock about +/-500ppm (so a 1000ppm range).
> > Past that it starts throwing errors.
>
> Well, it will start throwing errors but still it will correct
> the clock and find the frequency delta between the host clock
> and the reference clock just fine, and converge in a couple of
> hours, correct?

No NTP spec limits the freq correction to ~+/-500ppm. Once NTPd hits
that 500ppm wall, it will throw an error and stop trying to sync the
clock.

> 500ppm is 0.05% of a frequency drift which is awfully small -
> thermal effects alone can cause such differences so it should
> not be anything out of the ordinary for ntpd.

Practically I've not seen boxes that vary that much. I've seen very poor
systems who's crystals are off by ~280ppm, but those don't vary that
much over time much.

> > Part of the issue is that if the drift value changes in
> > between boots, NTPd can take a while to settle down on the
> > right freq. I suspect that's whats happening here, and should
> > the box be left alone for a few hours (maybe overnight) NTPd
> > will find the new drift correction the issue will go away.
>
> If the default poll interval of 64 seconds is used then it can
> take that much time - so i'd sugges to decrease that to below 10
> seconds.

Indeed. Shortening the maxpoll value in the ntp.conf greatly improves
how fast and how close the client will sync to the server, but take
caution, as that can cause undue load on public time servers.

thanks
-john

Jesper Krogh

unread,

Mar 1, 2009, 8:52:13 AM3/1/09

to john stultz, Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner

That wasn't true.. I got some real sunday testing done today. A fresh
2.6.28.7 has the same problem with a load of 0.00 0.00 0.00

2.6.27.19 doesn't have problems keeping time.

--
Jesper

Jesper Krogh

unread,

Mar 1, 2009, 10:05:31 AM3/1/09

to Linus Torvalds, john stultz, Linux Kernel Mailing List

Linus Torvalds wrote:
>
> On Thu, 26 Feb 2009, Jesper Krogh wrote:
>>> Also mind sending the full dmesg for both kernels?
>> http://krogh.cc/~jesper/dmesg-2.6.29-rc6.txt
>> http://krogh.cc/~jesper/dmesg-2.6.26.8.txt
>
> Try changing
>
> #define QUICK_PIT_MS 15
>
> in arch/x86/kernel/tsc.c into something bigger. Let's say just doubling
> it to 30. Does that change anything?

It seems to "slow down" the process (time from bootup to first clock
reset).

Mar 1 15:38:41 quad01 ntpd[4603]: synchronized to LOCAL(0), stratum 13
Mar 1 15:38:41 quad01 ntpd[4603]: kernel time sync status change 0001
Mar 1 15:39:47 quad01 ntpd[4603]: synchronized to 10.194.133.13, stratum 4
Mar 1 15:43:02 quad01 ntpd[4603]: synchronized to 10.194.133.12, stratum 4
Mar 1 15:53:41 quad01 ntpd[4603]: time reset -0.352221 s
Mar 1 15:57:18 quad01 ntpd[4603]: synchronized to LOCAL(0), stratum 13
Mar 1 15:58:23 quad01 ntpd[4603]: synchronized to 10.194.133.13, stratum 4
jk@quad01:~$ w
16:03:29 up 28 min, 2 users, load average: 0.04, 0.01, 0.00

--
Jesper

Jesper Krogh

unread,

Mar 1, 2009, 10:09:24 AM3/1/09

to Linus Torvalds, Linux Kernel Mailing List

Jesper Krogh wrote:
> The "current_clocsource" is the same on both systems.
>
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> tsc

What selects the "current_clocksource"? I tried to boot one of the
kernels hat have the problem on another piece of hardware and on that
system it ended up defaulting to "acpi_pm" instead of "tsc".

http://krogh.cc/~jesper/dmesg-2.6.28.7.txt

"acpi_pm" seems to be reliable all the time.

Sitsofe Wheeler

unread,

Mar 1, 2009, 10:44:55 AM3/1/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Sun, Mar 01, 2009 at 04:09:03PM +0100, Jesper Krogh wrote:
> Jesper Krogh wrote:
> >The "current_clocsource" is the same on both systems.
> >
> >$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> >tsc
>
> What selects the "current_clocksource"? I tried to boot one of the
> kernels hat have the problem on another piece of hardware and on that
> system it ended up defaulting to "acpi_pm" instead of "tsc".

I believe different clock sources have different priorities based on
their resolution and behaviour. Clock sources's that "go bad" because
hardware interactions are hopefully detected and subsequent "best" clock
sources are then tried.

There was a nice treatment of different clocksourcs in this
kernelnewbies thread:
http://www.mail-archive.com/kernel...@nl.linux.org/msg05164.html .

--
Sitsofe | http://sucs.org/~sits/

Jesper Krogh

unread,

Mar 1, 2009, 3:13:57 PM3/1/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

john stultz wrote:
>> Working (clocksource=tsc) 2.6.26.8
>> jk@quad03:~$ ntpdc -c kerninfo
>> pll offset: 0.003208 s
>> pll frequency: -25.070 ppm
> [snip]
>> Non-working (clocksource=tsc) 2.6.29-rc6
>> jk@quad12:~$ ntpdc -c kerninfo
>> pll offset: 0 s
>> pll frequency: -34.754 ppm
>
>
> Ok, so it seems ntp hasn't really had a chance to settle down, its only
> made a 10ppm adjustment so far. NTPd will stop corrections at ~
> +/-500ppm, so you're not at that bound yet, where things would be really
> broken.

But I should settle within a "reasonable" period of time? (not hours?).

> If the affected kernel isn't resetting in the logs anymore, I'd be
> interested in what the new ppm value is.

I keeps resetting after 7 hours .. Is there more information I can
provide?

Jesper

--
Jesper

Jesper Krogh

unread,

Mar 2, 2009, 4:54:10 AM3/2/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

john stultz wrote:
> Ok, so it seems ntp hasn't really had a chance to settle down, its only
> made a 10ppm adjustment so far. NTPd will stop corrections at ~
> +/-500ppm, so you're not at that bound yet, where things would be really
> broken.
>
> If the affected kernel isn't resetting in the logs anymore, I'd be
> interested in what the new ppm value is.

After 20 hours.. its still resetting.
Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
jk@quad12:~$ uptime
10:51:36 up 20:46, 1 user, load average: 0.00, 0.00, 0.00

And it hasn't shifted clocksource either.

jk@quad12:~$ cat
/sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

--
Jesper

john stultz

unread,

Mar 2, 2009, 4:28:24 PM3/2/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > Ok, so it seems ntp hasn't really had a chance to settle down, its only
> > made a 10ppm adjustment so far. NTPd will stop corrections at ~
> > +/-500ppm, so you're not at that bound yet, where things would be really
> > broken.
> >
> > If the affected kernel isn't resetting in the logs anymore, I'd be
> > interested in what the new ppm value is.
>
> After 20 hours.. its still resetting.
> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s

So what's the "ntpdc -c kerninfo" output now?

thanks
-john

Jesper Krogh

unread,

Mar 3, 2009, 1:04:28 AM3/3/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

john stultz wrote:
> On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
>> john stultz wrote:
>>> Ok, so it seems ntp hasn't really had a chance to settle down, its only
>>> made a 10ppm adjustment so far. NTPd will stop corrections at ~
>>> +/-500ppm, so you're not at that bound yet, where things would be really
>>> broken.
>>>
>>> If the affected kernel isn't resetting in the logs anymore, I'd be
>>> interested in what the new ppm value is.
>> After 20 hours.. its still resetting.
>> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
>> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
>
> So what's the "ntpdc -c kerninfo" output now?

Mar 3 06:41:10 quad12 ntpd[4416]: time reset -0.813957 s
Mar 3 06:45:20 quad12 ntpd[4416]: synchronized to LOCAL(0), stratum 13
Mar 3 06:45:36 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
Mar 3 06:51:57 quad12 ntpd[4416]: synchronized to 10.194.133.13, stratum 4
Mar 3 07:00:29 quad12 ntpd[4416]: time reset -0.783390 s

jk@quad12:~$ ntpdc -c kerninfo
pll offset: 0 s

pll frequency: -28.691 ppm
maximum error: 1.0433 s

estimated error: 0 s
status: 0001 pll

pll time constant: 4

precision: 1e-06 s
frequency tolerance: 500 ppm

jk@quad12:~$ w
07:03:17 up 1 day, 16:59, 1 user, load average: 0.00, 0.00, 0.00

john stultz

unread,

Mar 3, 2009, 3:02:01 PM3/3/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Tue, 2009-03-03 at 07:04 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
> >> john stultz wrote:
> >>> Ok, so it seems ntp hasn't really had a chance to settle down, its only
> >>> made a 10ppm adjustment so far. NTPd will stop corrections at ~
> >>> +/-500ppm, so you're not at that bound yet, where things would be really
> >>> broken.
> >>>
> >>> If the affected kernel isn't resetting in the logs anymore, I'd be
> >>> interested in what the new ppm value is.
> >> After 20 hours.. its still resetting.
> >> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> >> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
> >
> > So what's the "ntpdc -c kerninfo" output now?
>
> Mar 3 06:41:10 quad12 ntpd[4416]: time reset -0.813957 s
> Mar 3 06:45:20 quad12 ntpd[4416]: synchronized to LOCAL(0), stratum 13
> Mar 3 06:45:36 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> Mar 3 06:51:57 quad12 ntpd[4416]: synchronized to 10.194.133.13, stratum 4
> Mar 3 07:00:29 quad12 ntpd[4416]: time reset -0.783390 s
> jk@quad12:~$ ntpdc -c kerninfo
> pll offset: 0 s
> pll frequency: -28.691 ppm

This is baffling. You've only gone from -34.754ppm to -28.691ppm in over
a day? And you're still not syncing? If the calibration was so bad that
NTP couldn't sync, I'd expect the freq value to hit +/-500ppm before it
gave up. This just doesn't follow my expectations.

Could you provide:
/usr/sbin/ntpdc -c version

Do you see the same behavior if you drop all but one server (including
the local clock: 127.127.1.0)?

You might also add "minpoll 4 maxpoll 4" to the server line to speed up
testing.

Actually, if you could, I'd be interested if you could send your
ntp.conf

thanks
-john

Jesper Krogh

unread,

Mar 3, 2009, 3:20:03 PM3/3/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

It's resetting.. without deep knowledge about ntp, doesnt that mean
"start over again"? I believe it hits +/-500ppm

> Could you provide:
> /usr/sbin/ntpdc -c version

$ ntpdc -c version
ntpdc 4.2...@1.1520-o Tue Jan 6 15:51:00 UTC 2009 (1)

> Do you see the same behavior if you drop all but one server (including
> the local clock: 127.127.1.0)?
>
> You might also add "minpoll 4 maxpoll 4" to the server line to speed up
> testing.

Will try those option while debugging.

> Actually, if you could, I'd be interested if you could send your
> ntp.conf

http://krogh.cc/~jesper/ntp.conf

But this seems to be a "regression". Since 2.6.27.19 doesn't misbehave.
Same NTP, same configuration, same hardware. only change is the kernel
version. Or am I missing some parameter here?

Would it make sense to try to bisect it?

Jesper

--
Jesper

Jesper Krogh

unread,

Mar 3, 2009, 3:40:08 PM3/3/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

john stultz wrote:
> Do you see the same behavior if you drop all but one server (including
> the local clock: 127.127.1.0)?

Yes.
Mar 3 21:20:59 quad12 ntpd[2435]: ntpd 4.2...@1.1520-o Tue Jan 6
15:50:55 UTC 2009 (1)
Mar 3 21:20:59 quad12 ntpd[2436]: precision = 1.000 usec
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #0 wildcard,
0.0.0.0#123 Disabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #1 wildcard,
::#123 Disabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #2 lo, ::1#123
Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #3 bond0,
fe80::21e:68ff:fe57:8169#123 Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #4 lo,
127.0.0.1#123 Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #5 bond0,
10.194.132.91#123 Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: kernel time sync status 0040
Mar 3 21:20:59 quad12 ntpd[2436]: frequency initialized -29.286 PPM
from /var/lib/ntp/ntp.drift
Mar 3 21:21:58 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
Mar 3 21:21:58 quad12 ntpd[2436]: time reset -6.148275 s
Mar 3 21:21:58 quad12 ntpd[2436]: kernel time sync status change 0001
Mar 3 21:25:01 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
Mar 3 21:37:03 quad12 ntpd[2436]: time reset -0.664351 s

Only one server and the minpoll 4 maxpoll 4 options to the server line.

--
Jesper

john stultz

unread,

Mar 3, 2009, 5:24:45 PM3/3/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

Well, it may still need a few hours to settle. :) Again, those time
resets are seen when NTPd doesn't have a good drift ppm at startup, and
it has to find it.

thanks
-john

john stultz

unread,

Mar 3, 2009, 5:25:04 PM3/3/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

No, the "time reset" message means that when the offset is larger
then .125sec (the slew boundary), NTPd has corrected it by calling
settimeofday instead of slewing the clock.

Here's some background about how NTP and the kernel interact:
Every time NTPd calls adjtimex(), its provides the current offset from
the tracked ntp server. The kernel takes this offset and applies a
temporary correction factor to the clocksource frequency to converge
that offset. It also takes the provided offset, dampens it, and then
uses the result to adjust the frequency value. Once the freq value hits
the max adjustment value (+/- 500ppm), then NTP will start throwing
error messages and give up.

The part that is so odd with your data, is that the freq value isn't
changing very much. After a time reset, I'd expect to see adjustments in
the 100us, then multiple ms, and only once we get above 100ms to see
another time reset. All the while, these adjustment values should be
tweaking the freq value, causing the clocks to converge.

The case I can think of that could cause this, is if the drift is
somehow jumping above the slew boundary before NTPd actually makes any
adjtimex calls, so we end up with minimal correction to the freq value,
but that still doesn't completely vibe with the data.

> > Could you provide:
> > /usr/sbin/ntpdc -c version
>
> $ ntpdc -c version
> ntpdc 4.2...@1.1520-o Tue Jan 6 15:51:00 UTC 2009 (1)
>
> > Do you see the same behavior if you drop all but one server (including
> > the local clock: 127.127.1.0)?
> >
> > You might also add "minpoll 4 maxpoll 4" to the server line to speed up
> > testing.
>
> Will try those option while debugging.
>
> > Actually, if you could, I'd be interested if you could send your
> > ntp.conf
>
> http://krogh.cc/~jesper/ntp.conf

Cool, I see you're collecting stats already. Depending on the results of
the tests above I may want to check those out as well.

> But this seems to be a "regression". Since 2.6.27.19 doesn't misbehave.
> Same NTP, same configuration, same hardware. only change is the kernel
> version. Or am I missing some parameter here?
>
> Would it make sense to try to bisect it?

Well, I suspect you'll just bisect it to the fast-pit TSC calibration
causing a different correction freq to be needed for synchronization.
The odd part is that the userland NTPd isn't behaving as I'd expect if
the TSC calibration was really so bad that NTP couldn't handle it.

Bisection may be something worth trying just to verify or disprove that
theory, so if you have the time, it would be interesting to see. But if
the theory is true then we're back to the same spot.

I guess something to test my idea above (that the drift is bad enough
that NTPd isn't making slew adjustments via adjtimex offset) is to
remove NTPd from the init.d startup.

Then after rebooting (into 2.6.29), run the attached python script for
10 minutes or so to get an idea of the ppm drift. Then repeat with
2.6.26.

To run:
/drift-test.py <ntp server>

It will give some wild ppm numbers, but after a few minutes it should
settle down to the "natural drift" of the system.

thanks
-john

drift-test.py

Jesper Krogh

unread,

Mar 4, 2009, 12:36:38 AM3/4/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

With one server and the maxpoll minpoll stuff, this on "settled" after a
bit more than 3 hours:
Mar 4 01:14:05 quad12 ntpd[2436]: time reset -0.381826 s
Mar 4 01:15:39 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
jk@quad12:~$ uptime
06:35:40 up 15:55, 1 user, load average: 0.00, 0.00, 0.00
jk@quad12:~$ ntpq -c peers
remote refid st t when poll reach delay offset
jitter
==============================================================================
*bioinf.nzcorp.n 10.192.96.19 4 u 8 16 377 0.098 -80.184
0.673
jk@quad12:~$ ntpdc -c kerinfo
***Command `kerinfo' unknown

jk@quad12:~$ ntpdc -c kerninfo

pll offset: -0.06619 s
pll frequency: -500.000 ppm
maximum error: 0.130081 s
estimated error: 0.001201 s

status: 0001 pll
pll time constant: 4
precision: 1e-06 s
frequency tolerance: 500 ppm

--
Jesper

Jesper Krogh

unread,

Mar 4, 2009, 10:31:31 AM3/4/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

john stultz wrote:
> I guess something to test my idea above (that the drift is bad enough
> that NTPd isn't making slew adjustments via adjtimex offset) is to
> remove NTPd from the init.d startup.
>
> Then after rebooting (into 2.6.29), run the attached python script for
> 10 minutes or so to get an idea of the ppm drift. Then repeat with
> 2.6.26.
>
> To run:

> ./drift-test.py <ntp server>

>
> It will give some wild ppm numbers, but after a few minutes it should
> settle down to the "natural drift" of the system.

Ok. I removed ntpd from the system... heres is from "non-working
2.6.28.7 kernel".
04 Mar 14:59:16 offset: -0.139829 drift: -656.0 ppm
04 Mar 15:00:16 offset: -0.175233 drift: -591.147540984 ppm
04 Mar 15:01:16 offset: -0.210637 drift: -590.611570248 ppm
04 Mar 15:02:16 offset: -0.246033 drift: -590.386740331 ppm
04 Mar 15:03:17 offset: -0.28144 drift: -587.880165289 ppm
04 Mar 15:04:17 offset: -0.31684 drift: -588.301324503 ppm
04 Mar 15:05:17 offset: -0.352247 drift: -588.602209945 ppm
04 Mar 15:06:17 offset: -0.387649 drift: -588.805687204 ppm
04 Mar 15:07:17 offset: -0.423046 drift: -588.94813278 ppm
04 Mar 15:08:17 offset: -0.458451 drift: -589.073800738 ppm
04 Mar 15:09:18 offset: -0.493856 drift: -588.1973466 ppm
04 Mar 15:10:18 offset: -0.529265 drift: -588.374057315 ppm
04 Mar 15:11:18 offset: -0.564661 drift: -588.503457815 ppm
04 Mar 15:12:18 offset: -0.600063 drift: -588.620689655 ppm
04 Mar 15:13:18 offset: -0.635458 drift: -588.712930012 ppm
04 Mar 15:14:18 offset: -0.040699 drift: 109.052048726 ppm
04 Mar 15:15:18 offset: -0.076098 drift: 65.4984423676 ppm
04 Mar 15:16:18 offset: -0.111495 drift: 27.0557184751 ppm
04 Mar 15:17:18 offset: -0.146885 drift: -7.12096029548 ppm
04 Mar 15:18:19 offset: -0.182285 drift: -37.6853146853 ppm
04 Mar 15:19:19 offset: -0.217688 drift: -65.2117940199 ppm
04 Mar 15:20:19 offset: -0.253085 drift: -90.1202531646 ppm
04 Mar 15:21:19 offset: -0.288479 drift: -112.768882175 ppm
04 Mar 15:22:19 offset: -0.323866 drift: -133.448699422 ppm
04 Mar 15:23:19 offset: -0.359259 drift: -152.414127424 ppm
04 Mar 15:24:20 offset: -0.394648 drift: -169.750830565 ppm
04 Mar 15:25:20 offset: -0.430047 drift: -185.861980831 ppm
04 Mar 15:26:20 offset: -0.46544 drift: -200.779692308 ppm
04 Mar 15:27:20 offset: -0.500835 drift: -214.63620178 ppm
04 Mar 15:28:20 offset: -0.536221 drift: -227.534670487 ppm
04 Mar 15:29:20 offset: -0.571605 drift: -239.574515235 ppm
04 Mar 15:30:21 offset: -0.606992 drift: -250.706859593 ppm
04 Mar 15:31:21 offset: -0.64241 drift: -261.286085151 ppm
04 Mar 15:32:21 offset: -0.677792 drift: -271.20795569 ppm
04 Mar 15:33:21 offset: -0.713187 drift: -280.554252199 ppm
04 Mar 15:34:21 offset: -0.040744 drift: 46.7374169041 ppm
04 Mar 15:35:21 offset: -0.076145 drift: 29.0987996307 ppm
04 Mar 15:36:21 offset: -0.111551 drift: 12.4088050314 ppm
04 Mar 15:37:21 offset: -0.146952 drift: -3.40288713911 ppm

And from working 2.6.27.19 kernel.

jk@quad12:~$ python drift-test.py 10.192.96.19
04 Mar 16:17:23 offset: -0.006929 drift: -62.0 ppm
04 Mar 16:18:24 offset: -0.010252 drift: -54.5967741935 ppm
04 Mar 16:19:24 offset: -0.013574 drift: -54.9754098361 ppm
04 Mar 16:20:24 offset: -0.016897 drift: -55.1098901099 ppm
04 Mar 16:21:24 offset: -0.020233 drift: -55.2314049587 ppm
04 Mar 16:22:24 offset: -0.023566 drift: -55.2947019868 ppm
04 Mar 16:23:24 offset: -0.026895 drift: -55.3259668508 ppm
04 Mar 16:24:24 offset: -0.030217 drift: -55.3317535545 ppm
04 Mar 16:25:24 offset: -0.033539 drift: -55.3360995851 ppm
04 Mar 16:26:24 offset: -0.036865 drift: -55.3468634686 ppm
04 Mar 16:27:25 offset: -0.038266 drift: -52.0713101161 ppm
04 Mar 16:28:25 offset: -0.039747 drift: -49.592760181 ppm
04 Mar 16:29:25 offset: -0.041331 drift: -47.6680497925 ppm

Jesper Krogh

unread,

Mar 4, 2009, 1:37:25 PM3/4/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

Jesper Krogh wrote:
> john stultz wrote:
>> I guess something to test my idea above (that the drift is bad enough
>> that NTPd isn't making slew adjustments via adjtimex offset) is to
>> remove NTPd from the init.d startup.
>>
>> Then after rebooting (into 2.6.29), run the attached python script for
>> 10 minutes or so to get an idea of the ppm drift. Then repeat with
>> 2.6.26.
>>
>> To run: ./drift-test.py <ntp server>
>>
>> It will give some wild ppm numbers, but after a few minutes it should
>> settle down to the "natural drift" of the system.
>
> Ok. I removed ntpd from the system... heres is from "non-working

Updated. I think I has NTPd running in the former "non-working" test. I
just tried to reproduce the numbers, and they look like this
(reproducible on 2.6.29-rc6).

jk@quad12:~$ python drift-test.py 10.192.96.19

04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm

Still the same.

John Stultz

unread,

Mar 4, 2009, 1:59:48 PM3/4/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Wed, 2009-03-04 at 19:36 +0100, Jesper Krogh wrote:
> Jesper Krogh wrote:
> > john stultz wrote:
> >> I guess something to test my idea above (that the drift is bad enough
> >> that NTPd isn't making slew adjustments via adjtimex offset) is to
> >> remove NTPd from the init.d startup.
> >>
> >> Then after rebooting (into 2.6.29), run the attached python script for
> >> 10 minutes or so to get an idea of the ppm drift. Then repeat with
> >> 2.6.26.
> >>
> >> To run: ./drift-test.py <ntp server>
> >>
> >> It will give some wild ppm numbers, but after a few minutes it should
> >> settle down to the "natural drift" of the system.
> >
> > Ok. I removed ntpd from the system... heres is from "non-working
>
> Updated. I think I has NTPd running in the former "non-working" test. I
> just tried to reproduce the numbers, and they look like this
> (reproducible on 2.6.29-rc6).

Yea, the last numbers did look odd :)

> jk@quad12:~$ python drift-test.py 10.192.96.19
> 04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
> 04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
> 04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
> 04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
> 04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
> 04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
> 04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
> 04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
> 04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm

Yea, so from this and the settled ntpdc -c kerninfo data before, we can
see that the drift is further out then the 500ppm NTP can handle.

So with that at least confirmed, we can focus back on to the fast-pit
tsc calibration code.

Ingo, Thomas: I'm missing a bit of the context to that patch, other then
just speeding up boot times, was there other rational for moving away
from the ACPI PM timer based calibration?

Could we maybe add a quick test that the pit reads actually take the
assumed 2us max? Doing this maybe via the HPET/ACPI PM?

thanks
-john

john stultz

unread,

Mar 4, 2009, 9:39:38 PM3/4/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Wed, 2009-03-04 at 10:57 -0800, John Stultz wrote:
> On Wed, 2009-03-04 at 19:36 +0100, Jesper Krogh wrote:
> > jk@quad12:~$ python drift-test.py 10.192.96.19
> > 04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
> > 04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
> > 04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
> > 04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
> > 04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
> > 04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
> > 04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
> > 04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
> > 04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm
>
>
> Yea, so from this and the settled ntpdc -c kerninfo data before, we can
> see that the drift is further out then the 500ppm NTP can handle.
>
> So with that at least confirmed, we can focus back on to the fast-pit
> tsc calibration code.
>
> Ingo, Thomas: I'm missing a bit of the context to that patch, other then
> just speeding up boot times, was there other rational for moving away
> from the ACPI PM timer based calibration?
>
> Could we maybe add a quick test that the pit reads actually take the
> assumed 2us max? Doing this maybe via the HPET/ACPI PM?

Hey Jesper,

Here's a very-hackish patch to see if the approach I'm considering
might fix the issue you're hitting. Could you apply it, boot the kernel
a few times and send me the following segments of the dmesg for each of
those boots (the example below is from my test box)?

tsc delta: 44418024
ref_freq: 3000100 pit_freq: 3000384
TSC: Fast PIT calibration matches PMTIMER.
TSC: PIT calibration matches PMTIMER. 1 loops
Detected 3000.045 MHz processor.

I'm trying to see how regular the mis-calculation is, as well as see how
well the alternate calibration method does to handle this on your
hardware.

Its likely the fat pit calibration can be better integrated with the
other calibration methods, so this probably isn't anything close to what
the actual fix will look like.

Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
triggers probably 80-90% of the time. About 10-20% of the time, the
initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
look more at this approach.

john stultz

unread,

Mar 4, 2009, 9:52:30 PM3/4/09

to Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

Err. Sorry, hit send before I included the patch.

-john

Not for inclusion.

Signed-off-by: John Stultz <john...@us.ibm.com>

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..2e16d30 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -317,15 +317,17 @@ static unsigned long quick_pit_calibrate(void)

if (pit_expect_msb(0xff)) {
int i;
- u64 t1, t2, delta;
+ u64 t1, t2, delta, ref1, ref2;
+ u64 ref_freq = 0, pit_freq = 0;
+ int hpet = is_hpet_enabled();
unsigned char expect = 0xfe;

- t1 = get_cycles();
+ t1 = tsc_read_refs(&ref1, hpet);
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;
}
- t2 = get_cycles();
+ t2 = tsc_read_refs(&ref2, hpet);

/*
* Make sure we can rely on the second TSC timestamp:
@@ -333,6 +335,13 @@ static unsigned long quick_pit_calibrate(void)
if (!pit_expect_msb(expect))
goto failed;

+
+ delta = (t2 - t1);
+ if (hpet)
+ ref_freq = calc_hpet_ref(delta*1000000LL, ref1, ref2);
+ else
+ ref_freq = calc_pmtimer_ref(delta*1000000LL, ref1, ref2);
+
/*
* Ok, if we get here, then we've seen the
* MSB of the PIT decrement QUICK_PIT_ITERATIONS
@@ -347,10 +356,32 @@ static unsigned long quick_pit_calibrate(void)
* kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
* kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
*/
- delta = (t2 - t1)*PIT_TICK_RATE;
- do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
+ printk("tsc delta: %lld\n", t2-t1);
+
+ pit_freq = delta * PIT_TICK_RATE;
+ do_div(pit_freq, QUICK_PIT_ITERATIONS*256*1000);
+
+ printk("ref_freq: %lld pit_freq: %lld\n", ref_freq, pit_freq);
+
+ /* Check the reference deviation */
+ delta = ((u64) pit_freq) * 100;
+ do_div(delta, ref_freq);
+
+ /*
+ * If both calibration results are inside a 10% window
+ * then we can be sure, that the calibration
+ * succeeded. We break out of the loop right away. We
+ * use the reference value, as it is more precise.
+ */
+ if (delta >= 90 && delta <= 110) {
+ printk(KERN_INFO
+ "TSC: Fast PIT calibration matches %s.\n",
+ hpet ? "HPET" : "PMTIMER");
+ return ref_freq;
+ }
+
printk("Fast TSC calibration using PIT\n");
- return delta;
+ return pit_freq;
}
failed:
return 0;
@@ -375,7 +406,7 @@ unsigned long native_calibrate_tsc(void)
local_irq_save(flags);
fast_calibrate = quick_pit_calibrate();
local_irq_restore(flags);
- if (fast_calibrate)
+ if (0 && fast_calibrate)
return fast_calibrate;

/*

Ingo Molnar

unread,

Mar 5, 2009, 3:44:24 AM3/5/09

to john stultz, Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

* john stultz <john...@us.ibm.com> wrote:

> > Ingo, Thomas: On the hardware I'm testing the fast-pit
> > calibration only triggers probably 80-90% of the time. About
> > 10-20% of the time, the initial check to
> > pit_expect_msb(0xff) fails (count=0), so we may need to look
> > more at this approach.

We definitely need to improve calibration quality.

The question is - why does fast-calibration fail 10-20% of the
time on your test-system? Also, why exactly do we miscalibrate?
Could you please have a look at that?

One theory would be that the PIT readout is unreliable. Windows
does not make use of it, so it's not the most tested aspect of
the PIT. Is that what happens on your box?

Ingo

john stultz

unread,

Mar 5, 2009, 10:14:03 PM3/5/09

to Ingo Molnar, Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

On Thu, 2009-03-05 at 09:43 +0100, Ingo Molnar wrote:
> * john stultz <john...@us.ibm.com> wrote:
>
> > > Ingo, Thomas: On the hardware I'm testing the fast-pit
> > > calibration only triggers probably 80-90% of the time. About
> > > 10-20% of the time, the initial check to
> > > pit_expect_msb(0xff) fails (count=0), so we may need to look
> > > more at this approach.
>
> We definitely need to improve calibration quality.
>
> The question is - why does fast-calibration fail 10-20% of the
> time on your test-system? Also, why exactly do we miscalibrate?
> Could you please have a look at that?

Working on it, I just wanted to let you know I was seeing some different
odd behavior then Jesper.

> One theory would be that the PIT readout is unreliable. Windows
> does not make use of it, so it's not the most tested aspect of
> the PIT. Is that what happens on your box?

Still looking into it, but from my initial debugging it seems that by
reading the PIT very quickly after setting it, we may be getting junk
values. If I re-read the PIT again, I see the expected 0xff value.

Its been somewhat of a heisenbug, as if I add any printk's or even just
a mb() after the outb it seems to make the problem go away (or just rare
enough I don't have the patience to reproduce it :)

So I don't know if a small delay is appropriate here (seems counter
productive to the whole fast-pit calibration ;) or if we should just try
to catch these bad reads and try again before failing?

Thoughts?

thanks
-john

john stultz

unread,

Mar 5, 2009, 10:54:23 PM3/5/09

to Ingo Molnar, Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

Maybe something like the following? (Not tested heavily yet!)

Again, just for clarity, as we've mixed a few issues here, this patch is
for a side issue and not related to the original regression reported by
Jesper. I'm still waiting on debug output from Jesper to further
diagnose whats going wrong with his TSC calibration.

thanks
-john

Apparently some hardware may occasionally return junk values if you try
to read the pit immediately after setting it. This causes the
pit_expect_msb() to occasionally fail (~10% of the time).

This patch tries to work around this issue by not failing if the first
read right after setting the PIT is not what we expect.

NOT FOR INCLUSION (yet!)

Signed-off-by: John Stultz <john...@us.ibm.com>

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..2ca5ba4 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -280,8 +280,17 @@ static inline int pit_expect_msb(unsigned char val)
for (count = 0; count < 50000; count++) {
/* Ignore LSB */
inb(0x42);
- if (inb(0x42) != val)
+ if (inb(0x42) != val) {
+ /*
+ * If we're too fast, we may read
+ * junk values right after we set
+ * the PIT. So if this is the first
+ * read, try again
+ */
+ if (val == 0xff && count == 0)
+ continue;
break;
+ }
}
return count > 50;

Ingo Molnar

unread,

Mar 6, 2009, 6:35:36 AM3/6/09

to john stultz, Jesper Krogh, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

We could do something like that if it helps the end result. But
this special thing inside the loop should just be an
unconditional inb(0x42) outside the loop. It does not hurt
performance there, and we'll get simpler code that way.

But ... i really dont like how we rely on PIT readouts and how
we work around PIT readout artifacts. Only Linux does PIT
readouts while Windows does not - so we rely on a under-tested
aspect of PC hardware.

I think we should think about a fundamentally different, IRQ
driven way of calibration. For example we could program a 27
milliseconds PIT periodic interrupt with the maximum count and
measure its arrival timestamp in two subsequent interrupts.

We could do that with about 1-2 usecs precision realistically
(this early during bootup we are really quiescent) - and over a
27,000 usecs period that gives us an accuracy of 1:13500, or
about 75 ppm. That's still only about 50 milliseconds spent
calibrating, so very fast.

We can re-write the IRQ#0 vector with a special temporary
calibration interrupt handler to make this really single-purpose
and precise.

Hm?

Ingo

Jesper Krogh

unread,

Mar 9, 2009, 4:43:00 PM3/9/09

to john stultz, Thomas Gleixner, Linus Torvalds, Linux Kernel Mailing List, Len Brown

Hi John.

Patched into 2.6.28.7 ..

First boot.
[ 0.000000] tsc delta: 34203220
[ 0.000000] ref_freq: 2311825 pit_freq: 2310386
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
[ 0.000000] Detected 2311.877 MHz processor.
Second boot:
[ 0.000000] tsc delta: 34200313
[ 0.000000] ref_freq: 2311803 pit_freq: 2310190
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
[ 0.000000] Detected 2311.876 MHz processor.
Third boot:
[ 0.000000] tsc delta: 34198686
[ 0.000000] ref_freq: 2311824 pit_freq: 2310080
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 1 loops
[ 0.000000] Detected 2311.872 MHz processor.
Fourth boot:
[ 0.000000] tsc delta: 34199433
[ 0.000000] ref_freq: 2311831 pit_freq: 2310130
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
[ 0.000000] Detected 2311.821 MHz processor.

--
Jesper

Linus Torvalds

unread,

Mar 10, 2009, 12:28:06 AM3/10/09

to Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

On Mon, 9 Mar 2009, Jesper Krogh wrote:
>
> First boot.

> [ 0.000000] ref_freq: 2311825 pit_freq: 2310386

> Second boot:

> [ 0.000000] ref_freq: 2311803 pit_freq: 2310190

> Third boot:

> [ 0.000000] ref_freq: 2311824 pit_freq: 2310080

> Fourth boot:

> [ 0.000000] ref_freq: 2311831 pit_freq: 2310130

It's really quite impressively stable, but the fast-PIT calibration
frequency is reliably about 3/4 of a promille low. Or, put another way,
the TSC difference over the pit calibration is just a _tad_ too small
compared to the value we'd expect if that loop of pit_expect_msb() would
really run at the expected delay of a 1.193182MHz clock divided by 256.

And it's stable in that it really always seems to be off by a very similar
amount. It's not moving around very much.

I also wonder why it seems to happen mainly just to _you_. There's
absolutely nothing odd in your system, neither a slow CPU or anything
else that would stand out.

Grr. Very annoyingly non-obvious.

Linus

Thomas Gleixner

unread,

Mar 10, 2009, 7:31:00 AM3/10/09

to Linus Torvalds, Jesper Krogh, john stultz, Linux Kernel Mailing List, Len Brown

On Mon, 9 Mar 2009, Linus Torvalds wrote:
> On Mon, 9 Mar 2009, Jesper Krogh wrote:
> >
> > First boot.
> > [ 0.000000] ref_freq: 2311825 pit_freq: 2310386
> > Second boot:
> > [ 0.000000] ref_freq: 2311803 pit_freq: 2310190
> > Third boot:
> > [ 0.000000] ref_freq: 2311824 pit_freq: 2310080
> > Fourth boot:
> > [ 0.000000] ref_freq: 2311831 pit_freq: 2310130
>
> It's really quite impressively stable, but the fast-PIT calibration
> frequency is reliably about 3/4 of a promille low. Or, put another way,
> the TSC difference over the pit calibration is just a _tad_ too small
> compared to the value we'd expect if that loop of pit_expect_msb() would
> really run at the expected delay of a 1.193182MHz clock divided by 256.
>
> And it's stable in that it really always seems to be off by a very similar
> amount. It's not moving around very much.
>
> I also wonder why it seems to happen mainly just to _you_. There's
> absolutely nothing odd in your system, neither a slow CPU or anything
> else that would stand out.
>
> Grr. Very annoyingly non-obvious.

Indeed. One hint is in the slow calibration path. 3 of 4 boots have:

> > [ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops

So the slow calibration path detects some disturbance.

Jesper, can you please apply the following patch instead of Johns and
provide the output for a couple of boots? The output is:

Fast TSC calibration using PIT

tsc 43425305 tscmin 624008 tscmax 632610

Thanks,

tglx

--- linux-2.6.orig/arch/x86/kernel/tsc.c
+++ linux-2.6/arch/x86/kernel/tsc.c
@@ -317,15 +317,22 @@ static unsigned long quick_pit_calibrate

if (pit_expect_msb(0xff)) {
int i;
- u64 t1, t2, delta;

+ u64 t1, t2, t3, delta;

unsigned char expect = 0xfe;

+ unsigned long tscmin = ULONG_MAX, tscmax = 0;

- t1 = get_cycles();
+ t1 = t2 = get_cycles();

for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;

+ t3 = get_cycles();
+ delta = t3 - t2;
+ t2 = t3;
+ if ((unsigned long) delta < tscmin)
+ tscmin = (unsigned int) delta;
+ if ((unsigned long) delta > tscmax)
+ tscmax = (unsigned int) delta;
}
- t2 = get_cycles();

/*
* Make sure we can rely on the second TSC timestamp:

@@ -350,6 +357,8 @@ static unsigned long quick_pit_calibrate

delta = (t2 - t1)*PIT_TICK_RATE;

do_div(delta, QUICK_PIT_ITERATIONS*256*1000);

printk("Fast TSC calibration using PIT\n");

+ printk("tsc %ld tscmin %ld tscmax %ld\n",
+ (unsigned long) (t2 - t1), tscmin, tscmax);
return delta;
}
failed:

Jesper Krogh

unread,

Mar 10, 2009, 3:43:21 PM3/10/09

to Thomas Gleixner, Linus Torvalds, john stultz, Linux Kernel Mailing List, Len Brown

Organization: Internet mailing list

First boot:
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] tsc 34202223 tscmin 474069 tscmax 500664
Second boot:
Here I didnt get above messages.. http://krogh.cc/~jesper/dmesg-boot2.txt
Third boot:
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] tsc 34199856 tscmin 470321 tscmax 502182
Forth boot:
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] tsc 34202008 tscmin 475510 tscmax 501501

The second one is really strange.. is'nt it?

While booting up I saw this one on the serial console..
root@quad12:~# hwclock --systohc
Cannot access the Hardware Clock via any known method.
Use the --debug option to see the details of our search for an access
method.
root@quad12:~# hwclock --systohc --debug
hwclock from util-linux-ng 2.13.1
hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
No usable clock interface found.
Cannot access the Hardware Clock via any known method.

Jesper
--
Jesper

Thomas Gleixner

unread,

Mar 10, 2009, 6:23:23 PM3/10/09

to Jesper Krogh, Linus Torvalds, john stultz, Linux Kernel Mailing List, Len Brown, Ingo Molnar

Jesper,

On Tue, 10 Mar 2009, Jesper Krogh wrote:
> First boot:
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] tsc 34202223 tscmin 474069 tscmax 500664
> Second boot:
> Here I didnt get above messages.. http://krogh.cc/~jesper/dmesg-boot2.txt
> Third boot:
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] tsc 34199856 tscmin 470321 tscmax 502182
> Forth boot:
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] tsc 34202008 tscmin 475510 tscmax 501501
>
> The second one is really strange.. is'nt it?

No, there simply the fast PIT calibration failed and it dropped into
the slow path:
[ 0.000000] TSC: PIT calibration matches PMTIMER. 1 loops
[ 0.000000] Detected 2311.878 MHz processor.

But the variance of the third run is interesting:

avg = tsc / loops = 495650
avg - tscmin = 25329 (~ 10.9 us)
tscmax - avg = 6532 (~ 2.8 us)

While this is in the range which the PIT calibration code accepts the
resulting CPU frequency of this run is 2310.159 MHz which is way off
the result of the slow path in the 2nd run. The 1st and the 4th run
have significant high variance as well.

I run the same patch on a couple of test machines and all have
deviations from avg in the range of +/- 2 us and the calibration
result is stable and correct.

I have no idea what might cause the problem with your machine. PIT via
SMM emulation comes to mind :)

But we can use the tscmin/max method to figure out whether the fast
PIT result is reliable. See patch below. It should drop out into the
slow calibration path on every boot on your machine.

(tscmax - tscmin) / avg = 0.064 (result from third run)

On my test machines I get values below 0.02

While it's statistically not really correct we still can use that info
to catch cases like we see on your machines.

> While booting up I saw this one on the serial console..
> root@quad12:~# hwclock --systohc
> Cannot access the Hardware Clock via any known method.
> Use the --debug option to see the details of our search for an access method.
> root@quad12:~# hwclock --systohc --debug
> hwclock from util-linux-ng 2.13.1
> hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
> No usable clock interface found.
> Cannot access the Hardware Clock via any known method.

Can you provide your .config file please ?

Thanks,

tglx

--------->

Subject: x86: make TSC fast calibration more robust
From: Thomas Gleixner <tg...@linutronix.de>
Date: Tue, 10 Mar 2009 11:12:03 +0100

Check the min/max duration of each PIT loop against the resulting
average value and dismiss the fast calibration if it's larger than
2.5%. 2.5% is in the range of +/- 2us, which is a reasonable range
when we assume that a PIT read can easily take 1 us.

Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
arch/x86/kernel/tsc.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/tsc.c
===================================================================

--- linux-2.6.orig/arch/x86/kernel/tsc.c
+++ linux-2.6/arch/x86/kernel/tsc.c
@@ -317,15 +317,22 @@ static unsigned long quick_pit_calibrate

if (pit_expect_msb(0xff)) {
int i;
- u64 t1, t2, delta;
+ u64 t1, t2, t3, delta;
unsigned char expect = 0xfe;
+ unsigned long tscmin = ULONG_MAX, tscmax = 0;

- t1 = get_cycles();
+ t1 = t2 = get_cycles();
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;
+ t3 = get_cycles();
+ delta = t3 - t2;
+ t2 = t3;
+ if ((unsigned long) delta < tscmin)

+ tscmin = (unsigned long) delta;

+ if ((unsigned long) delta > tscmax)

+ tscmax = (unsigned long) delta;

}
- t2 = get_cycles();

/*
* Make sure we can rely on the second TSC timestamp:

@@ -334,6 +341,23 @@ static unsigned long quick_pit_calibrate
goto failed;

/*
+ * Sanity check the min max values:
+ *
+ * We calculate the average tsc increment per loop
+ * step. Now we take the tscmin and tscmax value and
+ * check whether the deviation is inside an acceptable
+ * range.
+ */

+ delta = (t2 - t1);

+ do_div(delta, QUICK_PIT_ITERATIONS);
+ t3 = (unsigned long) delta;
+ delta = tscmax - tscmin;
+ delta *= 10000;
+ do_div(delta, t3);
+ /* Fail if the deviation is > 2.5 % */
+ if (delta > 250)
+ goto failed;

+ /*
* Ok, if we get here, then we've seen the
* MSB of the PIT decrement QUICK_PIT_ITERATIONS

* times, and each MSB had many hits, so we never

Linus Torvalds

unread,

Mar 14, 2009, 9:23:04 PM3/14/09

to Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

Jesper, here's a patch that actually tries to take teh TSC error really
into account, and which I suspect will result (on your machine) in failing
the fast PIT calibration.

It also has a few extra printk's for debugging, and to see just what the
values are on your machine.

The idea behind the patch is to just keep track of how big the difference
was in TSC values between two successive reads of the PIT timer. We only
really care about the difference when the MSB turns around, and we only
really care about the two end points. The maximum error in TSC estimation
will simply be the sum of the differences at those points (d1 and d2).

We can then compare the maximum error with the actual TSC differences
between those points, and see if the max error is within 500 ppm. That
_should_ mean that it all works - assuming that the PIT itself is running
at the correct frequency, of course!

Regardless of whether is succeeds or not, it will print out some debug
messages, which will be interesting to see.

What's nice about this is that it really should make that whole "yes, it's
really within 500ppm" assertion have some solid legs to stand on. Rather
than depend on us being able to read the PIT a certain number of times, we
can literally give an estimation of the max error.

Linus

---
arch/x86/kernel/tsc.c | 41 +++++++++++++++++++++++++++++------------
1 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..8e1db42 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -273,17 +273,26 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin)
* use the TSC value at the transitions to calculate a pretty
* good value for the TSC frequencty.
*/
-static inline int pit_expect_msb(unsigned char val)
+static unsigned long pit_expect_msb(unsigned char val, u64 *tscp, unsigned long *deltap)
{
- int count = 0;
+ int count;
+ u64 tsc = 0;

for (count = 0; count < 50000; count++) {
/* Ignore LSB */
inb(0x42);

if (inb(0x42) != val)
break;
+ tsc = get_cycles();
}
- return count > 50;
+ *deltap = get_cycles() - tsc;
+ *tscp = tsc;
+
+ /*
+ * We require _some_ success, but the quality control
+ * will be based on the error terms on the TSC values.
+ */
+ return count > 5;
}

/*
@@ -297,6 +306,10 @@ static inline int pit_expect_msb(unsigned char val)

static unsigned long quick_pit_calibrate(void)
{
+ u64 t1, t2;
+ unsigned long d1, d2;
+ unsigned char expect = 0xff;
+
/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);

@@ -315,22 +328,24 @@ static unsigned long quick_pit_calibrate(void)
outb(0xff, 0x42);
outb(0xff, 0x42);

- if (pit_expect_msb(0xff)) {
+ if (pit_expect_msb(0xff, &t1, &d1)) {

int i;
- u64 t1, t2, delta;

- unsigned char expect = 0xfe;
+ u64 delta;

- t1 = get_cycles();
+ expect--;

for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {

- if (!pit_expect_msb(expect))
+ if (!pit_expect_msb(expect, &t2, &d2))

goto failed;
}
- t2 = get_cycles();

/*
- * Make sure we can rely on the second TSC timestamp:
+ * We require the max error on the calibration to be
+ * within 500 ppm, since that's the limit of ntpd
+ * drift correction. So the TSC delta must be more
+ * than 2000x the possible error term (d1+d2).
*/
- if (!pit_expect_msb(expect))
+ delta = t2 - t1;
+ if (d1+d2 > delta >> 11)
goto failed;

/*
@@ -347,12 +362,14 @@ static unsigned long quick_pit_calibrate(void)

* kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
* kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
*/
- delta = (t2 - t1)*PIT_TICK_RATE;

+ printk("Fast TSC delta=%lld, error=%lu+%lu=%lu\n", delta, d1, d2, d1+d2);
+ delta *= PIT_TICK_RATE;
do_div(delta, QUICK_PIT_ITERATIONS*256*1000);

printk("Fast TSC calibration using PIT\n");

return delta;
}
failed:
+ printk("Fast TSC calibration failed at %u %llu(%lu) %llu(%lu)\n", expect, t1, d1, t2, d2);
return 0;

Jesper Krogh

unread,

Mar 15, 2009, 11:44:59 AM3/15/09

to Linus Torvalds, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

Linus Torvalds wrote:
>
> Jesper, here's a patch that actually tries to take teh TSC error really
> into account, and which I suspect will result (on your machine) in failing
> the fast PIT calibration.
>
> It also has a few extra printk's for debugging, and to see just what the
> values are on your machine.
>
> The idea behind the patch is to just keep track of how big the difference
> was in TSC values between two successive reads of the PIT timer. We only
> really care about the difference when the MSB turns around, and we only
> really care about the two end points. The maximum error in TSC estimation
> will simply be the sum of the differences at those points (d1 and d2).
>
> We can then compare the maximum error with the actual TSC differences
> between those points, and see if the max error is within 500 ppm. That
> _should_ mean that it all works - assuming that the PIT itself is running
> at the correct frequency, of course!
>
> Regardless of whether is succeeds or not, it will print out some debug
> messages, which will be interesting to see.

[ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] Detected 2312.045 MHz processor.

Using "ntpq -c peers" .. the offset steadily grows as time goes.

Full dmesg: http://krogh.cc/~jesper/dmesg-linux-2.6.29-rc8-linus1.txt

jk@quad11:~$ ntpdc -c kerninfo
pll offset: 0.085167 s
pll frequency: -18.722 ppm
maximum error: 0.137231 s
estimated error: 0.008823 s
status: 0001 pll
pll time constant: 6

precision: 1e-06 s
frequency tolerance: 500 ppm

--
Jesper

Linus Torvalds

unread,

Mar 15, 2009, 2:13:38 PM3/15/09

to Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

On Sun, 15 Mar 2009, Jesper Krogh wrote:
> Linus Torvalds wrote:
> >
> > Regardless of whether is succeeds or not, it will print out some debug
> > messages, which will be interesting to see.
>
>
> [ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] Detected 2312.045 MHz processor.

Ok. This claims that the error really is smaller than 500ppm (it's about
360 ppm). Which is about what we're aiming for (in real life, the actual
error is about half that - we're just adding up the error terms for
maximum theoretical error).

> Using "ntpq -c peers" .. the offset steadily grows as time goes.
>
> Full dmesg: http://krogh.cc/~jesper/dmesg-linux-2.6.29-rc8-linus1.txt
>
> jk@quad11:~$ ntpdc -c kerninfo
> pll offset: 0.085167 s
> pll frequency: -18.722 ppm
> maximum error: 0.137231 s
> estimated error: 0.008823 s
> status: 0001 pll
> pll time constant: 6
> precision: 1e-06 s
> frequency tolerance: 500 ppm

Hmm. But now it all seems to _work_, no? Or do you still get time resets?
Now your "pll frequency" and "estimated error" are real values, not just
"0s" like in your previous failure cases.

Of course, maybe that happens only after the time reset actually kicks in.

But one thing my patch did - apart from the error estimation - was to
synchronize the TSC read with the actual PIT MSB wrap event. Maybe that
mattered.

The other possibility (if the time reset actually happens) is that your
PIT is simply not running at the expected frequency. That would be really
quite odd, since that nominal 1193181.8181 Hz frequency is very standard,
and has been around foreve.

I do not know how to test that. We need a reference timer to sync to, and
the PIT has traditionally been a _lot_ more reliable than the other timers
in the system (the PM timer may be reliable on modern machines, but almost
certainly not on anything a few years old).

Linus

Jesper Krogh

unread,

Mar 15, 2009, 2:39:08 PM3/15/09

to Linus Torvalds, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

Linus Torvalds wrote:
>
> On Sun, 15 Mar 2009, Jesper Krogh wrote:
>> Linus Torvalds wrote:
>>> Regardless of whether is succeeds or not, it will print out some debug
>>> messages, which will be interesting to see.
>>
>> [ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
>> [ 0.000000] Fast TSC calibration using PIT
>> [ 0.000000] Detected 2312.045 MHz processor.
>
> Ok. This claims that the error really is smaller than 500ppm (it's about
> 360 ppm). Which is about what we're aiming for (in real life, the actual
> error is about half that - we're just adding up the error terms for
> maximum theoretical error).
>
>> Using "ntpq -c peers" .. the offset steadily grows as time goes.
>>
>> Full dmesg: http://krogh.cc/~jesper/dmesg-linux-2.6.29-rc8-linus1.txt
>>
>> jk@quad11:~$ ntpdc -c kerninfo
>> pll offset: 0.085167 s
>> pll frequency: -18.722 ppm
>> maximum error: 0.137231 s
>> estimated error: 0.008823 s
>> status: 0001 pll
>> pll time constant: 6
>> precision: 1e-06 s
>> frequency tolerance: 500 ppm
>
> Hmm. But now it all seems to _work_, no? Or do you still get time resets?

My conclusion was that I would get a time reset after some time since
the offset just increased as time went by (being reasonably small at the
beginning).

I had it up for around 30 minutes... Should I have tested longer?

I went on to trying Thomas Gleixners patch (which seems to do excactly
the same .. ), I'll write a reply in to that message in a few minutes.

--
Jesper

Linus Torvalds

unread,

Mar 15, 2009, 3:05:48 PM3/15/09

to Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

On Sun, 15 Mar 2009, Jesper Krogh wrote:
> > >
> > > [ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
> > > [ 0.000000] Fast TSC calibration using PIT
> > > [ 0.000000] Detected 2312.045 MHz processor.
>

> My conclusion was that I would get a time reset after some time since the
> offset just increased as time went by (being reasonably small at the
> beginning).
>
> I had it up for around 30 minutes... Should I have tested longer?

It would be good to test longer. Your previous emails showed:

2.6.26: time.c: Detected 2311.847 MHz processor.
2.6.29: Detected 2310.029 MHz processor.

where that first one was a successful boot, and the second one was a
failing one. So let's assume that 2311.847 is the "correct" frequency.

The difference between the correct one and your failing one is ~790 ppm,
which is above the 500ppm ntpd threshhold. And as we saw earlier, those
differences were pretty consistent, ie in your list of four successive
boots, the old code consistently gave a frequency error that was roughly
7 permille off (ie exactly that 700 ppm).

HOWEVER! With that patch you just tried, you got

Detected 2312.045 MHz processor.

and the difference between _that_ and the assumed-correct-one is actually
just 85 ppm. Which should be perfectly fine.

[ With the "test against PM timer, you had:

[ 0.000000] ref_freq: 2311825 pit_freq: 2310386

[ 0.000000] ref_freq: 2311803 pit_freq: 2310190

[ 0.000000] ref_freq: 2311824 pit_freq: 2310080

[ 0.000000] ref_freq: 2311831 pit_freq: 2310130

on four boots, so averaging them gives 2311.82 Mhz, and the 2312.045MHz
you got with the improved fast-PIT code is still _way_ below 500ppm from
that - it's ~95 ppm away.

IOW, the new frequency realy looks likely to work. ]

Quite frankly, we don't know how exact the PM-timer is either - we just
know that the detection is "stable" (but so was the old PIT timer
detection: it was stably at 700ppm lower from the PM timer. So there is
nothing that says that 2311.82Mhz is the "correct" frequency, but we
obviously know from your ntpd saga that it is much closer to correct than
the old 2310.029 was.

End result of all this: I'd really like you to try the modified PIT
frequency code for longer. Also, remember that getting one (or a couple)
"time reset" messages from ntpd while it's trying to sync up is not a
problem per se - it can validly take a while to synchronize. The problem
is literally only if it doesn't synchonize over time at all.

Linus

Jesper Krogh

unread,

Mar 15, 2009, 3:52:56 PM3/15/09

to Linus Torvalds, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

Linus Torvalds wrote:
> End result of all this: I'd really like you to try the modified PIT
> frequency code for longer. Also, remember that getting one (or a couple)
> "time reset" messages from ntpd while it's trying to sync up is not a
> problem per se - it can validly take a while to synchronize. The problem
> is literally only if it doesn't synchonize over time at all.

Ok. I'll get it on and report back in 24 hours or so..

--
Jesper

Jesper Krogh

unread,

Mar 15, 2009, 3:53:52 PM3/15/09

to Thomas Gleixner, Linus Torvalds, john stultz, Linux Kernel Mailing List, Len Brown, Ingo Molnar

Thomas Gleixner wrote:

> slow calibration path on every boot on your machine.
>
> (tscmax - tscmin) / avg = 0.064 (result from third run)
>
> On my test machines I get values below 0.02
>
> While it's statistically not really correct we still can use that info
> to catch cases like we see on your machines.
>
>> While booting up I saw this one on the serial console..
>> root@quad12:~# hwclock --systohc
>> Cannot access the Hardware Clock via any known method.
>> Use the --debug option to see the details of our search for an access method.
>> root@quad12:~# hwclock --systohc --debug
>> hwclock from util-linux-ng 2.13.1
>> hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
>> No usable clock interface found.
>> Cannot access the Hardware Clock via any known method.
>
> Can you provide your .config file please ?

http://krogh.cc/~jesper/config-2.6.29-rc8.txt

I testet the attached patch.. and after 1.5 hours it seems to work. I'll
remain on this one at least a day to see how it works. I'll keep it on
for now and report back in 24 hours or so.

Its still using tsc as clock-source.

Jesper

--
Jesper

Linus Torvalds

unread,

Mar 15, 2009, 4:36:16 PM3/15/09

to Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

On Sun, 15 Mar 2009, Jesper Krogh wrote:
>
> I went on to trying Thomas Gleixners patch (which seems to do excactly the
> same .. ), I'll write a reply in to that message in a few minutes.

Side note: no, Thomas' patch doesn't do at all exactly the same. It does
something similar, in that it looks at the time differences between calls
to the whole "wait for the PIT MSB to change" function, but those
differences _could_ in theory be very small, even if the error is very
big.

That's especially true if the PIT read ends up serializing with the PIT,
so that the "wait for MSB" essentially always takes exactly the same
amount of cycles (giving a zero error estimation in Thomas' version), but
the reads themselves can still be quite slow (giving a non-zero error term
in the end result).

IOW, Thomas' patch is good at finding variability in the reads - which
could be the result of SMM interaction, while my patch literally measures
how long it takes to read the MSB change.

Now in practice I suspect the variability in the MSB reads _probably_
correlate reasonably well with how long a single PIT read will take (ie
rather than finding variability due to SMM interaction, it will find
variability due to the "quanitization" effect of the reads taking a
reasonably long time), so I suspect that in many cases Thomas' patch will
error out for the same cases mine does.

But the two patches are rather fundamentally different.

Linus

Jesper Krogh

unread,

Mar 16, 2009, 2:41:10 PM3/16/09

to Thomas Gleixner, Linus Torvalds, john stultz, Linux Kernel Mailing List, Len Brown, Ingo Molnar

Jesper Krogh wrote:
> Thomas Gleixner wrote:
>
>> slow calibration path on every boot on your machine.
>>
>> (tscmax - tscmin) / avg = 0.064 (result from third run)
>>
>> On my test machines I get values below 0.02
>>
>> While it's statistically not really correct we still can use that info
>> to catch cases like we see on your machines.
>>
>>> While booting up I saw this one on the serial console..
>>> root@quad12:~# hwclock --systohc
>>> Cannot access the Hardware Clock via any known method.
>>> Use the --debug option to see the details of our search for an access
>>> method.
>>> root@quad12:~# hwclock --systohc --debug
>>> hwclock from util-linux-ng 2.13.1
>>> hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
>>> No usable clock interface found.
>>> Cannot access the Hardware Clock via any known method.
>>
>> Can you provide your .config file please ?
>
> http://krogh.cc/~jesper/config-2.6.29-rc8.txt
>
> I testet the attached patch.. and after 1.5 hours it seems to work. I'll
> remain on this one at least a day to see how it works. I'll keep it on
> for now and report back in 24 hours or so.
>
> Its still using tsc as clock-source.

No resets after 24 hours.. it works.

Jesper Krogh

unread,

Mar 16, 2009, 3:00:34 PM3/16/09

to Linus Torvalds, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

Jesper Krogh wrote:
> Linus Torvalds wrote:
>> End result of all this: I'd really like you to try the modified PIT
>> frequency code for longer. Also, remember that getting one (or a
>> couple) "time reset" messages from ntpd while it's trying to sync up
>> is not a problem per se - it can validly take a while to synchronize.
>> The problem is literally only if it doesn't synchonize over time at all.
>
> Ok. I'll get it on and report back in 24 hours or so..

you were right. It works. No resets so far.

Linus Torvalds

unread,

Mar 16, 2009, 3:36:13 PM3/16/09

to Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown, Ingo Molnar

On Mon, 16 Mar 2009, Jesper Krogh wrote:
>
> you were right. It works. No resets so far.

Goodie.

Here's a slightly cleaned-up patch that removes the debug messages, and
also re-organizes the code a bit so that it actually uses the "better than
500 ppm" as the way to decide when to stop calibrating.

Why?

I tested the 500 ppm check on some slower machines, and the old algorithm
of just waiting for 15ms actually failed that 500 ppm test. It was _very_
close - 16ms was enough - but it convinced me that the logic was too damn
fragile.

I also think I know why John reported this:

> Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
> triggers probably 80-90% of the time. About 10-20% of the time, the
> initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
> look more at this approach.

and the reason is that when we re-program the PIT, it will actually take
until the next timer edge (the incoming 1.1MHz timer) for the new values
to take effect. So before the first call to pit_expect_msb(), we should
make sure to delay for at least one PIT cycle. The simplest way to do that
is to simply read the PIT latch once, it will take about 2us.

So this patch fixes that too.

John, does that make the PIT calibration work reliably on your machine?

The patch looks bigger than it is: most of the noise is just
re-indentation and some trivial re-organizing.

Linus

---
arch/x86/kernel/tsc.c | 110 +++++++++++++++++++++++++++++--------------------
1 files changed, 65 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..d5cebb5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -273,30 +273,43 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin)

* use the TSC value at the transitions to calculate a pretty
* good value for the TSC frequencty.
*/
-static inline int pit_expect_msb(unsigned char val)

+static inline int pit_expect_msb(unsigned char val, u64 *tscp, unsigned long *deltap)

{
- int count = 0;
+ int count;
+ u64 tsc = 0;

for (count = 0; count < 50000; count++) {
/* Ignore LSB */
inb(0x42);
if (inb(0x42) != val)
break;
+ tsc = get_cycles();
}
- return count > 50;
+ *deltap = get_cycles() - tsc;
+ *tscp = tsc;
+
+ /*
+ * We require _some_ success, but the quality control
+ * will be based on the error terms on the TSC values.
+ */
+ return count > 5;
}

/*

- * How many MSB values do we want to see? We aim for a
- * 15ms calibration, which assuming a 2us counter read
- * error should give us roughly 150 ppm precision for
- * the calibration.
+ * How many MSB values do we want to see? We aim for
+ * a maximum error rate of 500ppm (in practice the
+ * real error is much smaller), but refuse to spend
+ * more than 25ms on it.
*/
-#define QUICK_PIT_MS 15
-#define QUICK_PIT_ITERATIONS (QUICK_PIT_MS * PIT_TICK_RATE / 1000 / 256)
+#define MAX_QUICK_PIT_MS 25
+#define MAX_QUICK_PIT_ITERATIONS (MAX_QUICK_PIT_MS * PIT_TICK_RATE / 1000 / 256)

static unsigned long quick_pit_calibrate(void)
{
+ int i;
+ u64 tsc, delta;

+ unsigned long d1, d2;
+

/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);

@@ -315,45 +328,52 @@ static unsigned long quick_pit_calibrate(void)

outb(0xff, 0x42);
outb(0xff, 0x42);

- if (pit_expect_msb(0xff)) {

- int i;

- u64 t1, t2, delta;
- unsigned char expect = 0xfe;

-
- t1 = get_cycles();
- for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
- if (!pit_expect_msb(expect))
- goto failed;
+ /*
+ * The PIT starts counting at the next edge, so we
+ * need to delay for a microsecond. The easiest way
+ * to do that is to just read back the 16-bit counter
+ * once from the PIT.
+ */
+ inb(0x42);
+ inb(0x42);
+
+ if (pit_expect_msb(0xff, &tsc, &d1)) {
+ for (i = 1; i <= MAX_QUICK_PIT_ITERATIONS; i++) {
+ if (!pit_expect_msb(0xff-i, &delta, &d2))
+ break;
+
+ /*
+ * Iterate until the error is less than 500 ppm
+ */
+ delta -= tsc;

+ if (d1+d2 < delta >> 11)

+ goto success;
}
- t2 = get_cycles();
-
- /*

- * Make sure we can rely on the second TSC timestamp:

- */
- if (!pit_expect_msb(expect))
- goto failed;
-
- /*
- * Ok, if we get here, then we've seen the
- * MSB of the PIT decrement QUICK_PIT_ITERATIONS
- * times, and each MSB had many hits, so we never
- * had any sudden jumps.
- *
- * As a result, we can depend on there not being
- * any odd delays anywhere, and the TSC reads are
- * reliable.
- *
- * kHz = ticks / time-in-seconds / 1000;
- * kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
- * kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
- */

- delta = (t2 - t1)*PIT_TICK_RATE;

- do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
- printk("Fast TSC calibration using PIT\n");
- return delta;
}
-failed:
+ printk("Fast TSC calibration failed\n");
return 0;
+
+success:
+ /*
+ * Ok, if we get here, then we've seen the
+ * MSB of the PIT decrement 'i' times, and the
+ * error has shrunk to less than 500 ppm.
+ *
+ * As a result, we can depend on there not being
+ * any odd delays anywhere, and the TSC reads are
+ * reliable (within the error). We also adjust the
+ * delta to the middle of the error bars, just
+ * because it looks nicer.
+ *
+ * kHz = ticks / time-in-seconds / 1000;
+ * kHz = (t2 - t1) / (I * 256 / PIT_TICK_RATE) / 1000
+ * kHz = ((t2 - t1) * PIT_TICK_RATE) / (I * 256 * 1000)
+ */
+ delta += (long)(d2 - d1)/2;
+ delta *= PIT_TICK_RATE;
+ do_div(delta, i*256*1000);
+ printk("Fast TSC calibration using PIT\n");
+ return delta;
}

/**

john stultz

unread,

Mar 16, 2009, 9:43:49 PM3/16/09

to Linus Torvalds, Jesper Krogh, Thomas Gleixner, Linux Kernel Mailing List, Len Brown, Ingo Molnar

On Mon, 2009-03-16 at 12:32 -0700, Linus Torvalds wrote:
> I also think I know why John reported this:
>
> > Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
> > triggers probably 80-90% of the time. About 10-20% of the time, the
> > initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
> > look more at this approach.
>
> and the reason is that when we re-program the PIT, it will actually take
> until the next timer edge (the incoming 1.1MHz timer) for the new values
> to take effect. So before the first call to pit_expect_msb(), we should
> make sure to delay for at least one PIT cycle. The simplest way to do that
> is to simply read the PIT latch once, it will take about 2us.
>
> So this patch fixes that too.
>
> John, does that make the PIT calibration work reliably on your machine?

Yep, I haven't seen a failure with it so far. And it's the same net
effect change my earlier patch was doing (one extra read cycle) just
without all the conditionals, so it should be fine.

thanks
-john

Ingo Molnar

unread,

Mar 17, 2009, 4:15:08 AM3/17/09

to Linus Torvalds, Peter Zijlstra, Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

Cool. Will you apply it yourself (in the merge window) or should
we pick it up?

Incidentally, yesterday i wrote a PIT auto-calibration routine
(see WIP patch below).

The core idea is to use _all_ thousands of measurement points
(not just two) to calculate the frequency ratio, with a built-in
noise detector which drops out of the loop if the observed noise
goes below ~10 ppm.

It is free-running: i.e. it observes noise and if the result
stabilizes quickly it can exit quickly. (with an upper bound for
unreliable PITs or virtualized systems, etc.)

It's WIP because it's not working yet (or at all?): i couldnt
get the statistical model right - it's too noisy at 1000-2000
ppm and the frequency result is off by 5000 ppm. Totally against
expectations. I traced it on a box with a good PIT and in the
trace the calculations look sane and the noise levels go down
nicely - except that the result sucks.

I also like yours more because it's simpler.

Ingo

Index: linux/arch/x86/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc.c
+++ linux/arch/x86/kernel/tsc.c
@@ -240,63 +240,201 @@ static unsigned long pit_calibrate_tsc(u
}

/*
- * This reads the current MSB of the PIT counter, and
- * checks if we are running on sufficiently fast and
- * non-virtualized hardware.
+ * Rolling statistical analysis of (PIT,TSC) measurement deltas.
*
- * Our expectations are:
- *
- * - the PIT is running at roughly 1.19MHz
- *
- * - each IO is going to take about 1us on real hardware,
- * but we allow it to be much faster (by a factor of 10) or
- * _slightly_ slower (ie we allow up to a 2us read+counter
- * update - anything else implies a unacceptably slow CPU
- * or PIT for the fast calibration to work.
- *
- * - with 256 PIT ticks to read the value, we have 214us to
- * see the same MSB (and overhead like doing a single TSC
- * read per MSB value etc).
- *
- * - We're doing 2 reads per loop (LSB, MSB), and we expect
- * them each to take about a microsecond on real hardware.
- * So we expect a count value of around 100. But we'll be
- * generous, and accept anything over 50.
- *
- * - if the PIT is stuck, and we see *many* more reads, we
- * return early (and the next caller of pit_expect_msb()
- * then consider it a failure when they don't see the
- * next expected value).
- *
- * These expectations mean that we know that we have seen the
- * transition from one expected value to another with a fairly
- * high accuracy, and we didn't miss any events. We can thus
- * use the TSC value at the transitions to calculate a pretty
- * good value for the TSC frequencty.
+ * We use a decaying average to estimate current noise levels.
+ * If noise falls below the expected threshold we exit the loop
+ * with the result.
+ *
+ * If this never happens - for example because the PIT is unreliable,
+ * then we break out after a limit and fail this type of calibration.
+ *
+ * Note that this method observes the statistical noise as-is without
+ * making any assumptions, so it is fundamentally robust against
+ * occasional PIT blips or SMI related system activities that can
+ * disturb calibration. An SMI in the wrong moment pushes up the
+ * noise level and causes the calibration loop to exit a tiny bit
+ * later - but still with a precise and reliable result.

*/
-static inline int pit_expect_msb(unsigned char val)

+static s64 sum_slope;
+static s64 sum_slope_noise;
+static s64 prev_slope;
+
+static int nr_measurements;
+
+#define MAX_MEASUREMENTS 10000
+
+#define MIN_MEASUREMENTS 100
+
+struct entry {
+ u64 tsc;
+ unsigned int pit;
+};
+
+/*
+ * A single measurement is as simple as possible:
+ */
+static inline void do_one_measurement(struct entry *entry)

{
- int count = 0;

+ unsigned char pit_lsb, pit_msb;
+ u64 tsc;

- for (count = 0; count < 50000; count++) {
- /* Ignore LSB */
- inb(0x42);

- if (inb(0x42) != val)

- break;
- }
- return count > 50;
+ /*
+ * We use the PIO accesses as natural TSC serialization barriers:
+ */
+ pit_lsb = inb(0x42);
+ tsc = get_cycles();
+ pit_msb = inb(0x42);
+
+ entry->tsc = tsc;
+ entry->pit = pit_msb*256 + pit_lsb;
+
+ trace_printk("tsc: %Ld, count: %d, nr: %d\n",
+ entry->tsc, entry->pit, nr_measurements);

}

/*
- * How many MSB values do we want to see? We aim for a
- * 15ms calibration, which assuming a 2us counter read
- * error should give us roughly 150 ppm precision for
- * the calibration.

+ * We scale numbers up by 1024 to reduce quantization effects:

*/
-#define QUICK_PIT_MS 15
-#define QUICK_PIT_ITERATIONS (QUICK_PIT_MS * PIT_TICK_RATE / 1000 / 256)

+static unsigned long do_delta_analysis(struct entry *e0, struct entry *e1)
+{
+ s64 slope, dslope;
+ s64 noise;
+ int decay;
+ int dc;
+ s64 dt;
+
+ dt = e1->tsc - e0->tsc; /* TSC is going up */
+ dc = e0->pit - e1->pit; /* PIT counter is going down */
+
+ /*
+ * Delta-PIT-count can be positive (or negative in case of
+ * an anomaly), but we made sure in do_measurement() that
+ * it can never be zero:
+ */
+ slope = 1024 * dt / dc;
+
+ dslope = slope - prev_slope;
+ noise = dslope;
+
+ trace_printk(" dt: %20Ld\n", dt);
+ trace_printk(" dc: %20d\n", dc);
+ trace_printk(" slope: %20Ld\n", slope);
+ trace_printk(" dslope: %20Ld\n", dslope);
+
+ /*
+ * Add a gentle decaying average to the slope and noise averages:
+ */
+ trace_printk(" prev sum_slope: %20Ld\n", sum_slope);

-static unsigned long quick_pit_calibrate(void)
+ /*
+ * Dynamic decay - starts with low values then later on
+ * the system cools down:
+ */
+ decay = 1;
+ if (sum_slope_noise)
+ decay = sum_slope / 64 / sum_slope_noise;
+ decay = min(2000, decay);
+ decay = max(nr_measurements/4, decay);
+
+ sum_slope = ((decay - 1)*sum_slope + slope)/decay;
+ trace_printk(" new sum_slope: %20Ld [decay: %d]\n",
+ sum_slope, decay);
+
+ trace_printk(" prev sum_slope_noise: %20Ld\n", sum_slope_noise);
+ sum_slope_noise = (1023*sum_slope_noise + noise)/1024;
+ trace_printk(" new sum_slope_noise: %20Ld\n", sum_slope_noise);
+
+ prev_slope = slope;
+
+ if (nr_measurements >= 64*MIN_MEASUREMENTS && sum_slope_noise < 10 ) {
+ trace_printk(" => low noise early exit!\n");
+ return 1;
+ }
+
+ return 0;
+}
+
+static int do_measurements(void)
+{
+ unsigned int pit_stuck;
+ unsigned long flags;
+ struct entry e0, e1;
+ int err = 0;
+
+ sum_slope_noise = 0;
+ sum_slope = 0;
+ prev_slope = 0;
+
+ nr_measurements = 0;
+
+ local_irq_save(flags);
+
+ trace_printk("PIT begin\n");
+ do_one_measurement(&e0);
+
+ do_one_measurement(&e0);
+
+ for (;;) {
+ pit_stuck = 0;
+repeat_e1:
+ do_one_measurement(&e1);
+ /*
+ * The typical case is that the PIT advanced a bit
+ * since we last read it (the PIOs take time, etc.).
+ * In case it did not advance (some really fast
+ * PIO implementation or virtualization) we will allow
+ * the count to stay 'stuck' up to 100 times:
+ *
+ * (Note that making sure that the count progresses also
+ * simplifies data processing later on.)
+ */
+ if (e0.pit != e1.pit) {
+ nr_measurements++;
+ if (nr_measurements >= MAX_MEASUREMENTS) {
+ printk("PIT: final count: %d\n", e1.pit);
+ break;
+ }
+ if (do_delta_analysis(&e0, &e1)) {
+ printk("PIT: low-noise count: %d\n", e1.pit);

+ break;
+ }
+ /*

+ * Reuse the second measurement point for the
+ * next delta measurement:
+ */
+ e0 = e1;
+ trace_printk("\n");
+ continue;
+ }
+ if (pit_stuck++ < 100)
+ goto repeat_e1;
+
+ printk(KERN_INFO "PIT auto-calibration: counter stuck at %d!\n",
+ e1.pit);
+ err = -EINVAL;
+ }
+
+ trace_printk("PIT end\n");
+ local_irq_restore(flags);
+
+ return err;
+}
+
+static unsigned long auto_pit_calibrate(void)
+{
+ if (do_measurements() < 0)
+ return 0;
+
+ printk("PIT: sum_slope: %Ld\n", sum_slope);
+ printk("PIT: Hz: %Ld\n", sum_slope * PIT_TICK_RATE);
+ printk("PIT: sum_slope_noise: %Ld\n", sum_slope_noise);
+ printk("PIT: nr_measurements: %d\n", nr_measurements);
+
+ return sum_slope * PIT_TICK_RATE / 1024 / 1000;
+}
+
+unsigned long quick_pit_calibrate(void)
{

/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);

@@ -316,45 +454,7 @@ static unsigned long quick_pit_calibrate

outb(0xff, 0x42);
outb(0xff, 0x42);

- if (pit_expect_msb(0xff)) {
- int i;
- u64 t1, t2, delta;
- unsigned char expect = 0xfe;
-
- t1 = get_cycles();
- for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
- if (!pit_expect_msb(expect))
- goto failed;

- }

-failed:
- return 0;
+ return auto_pit_calibrate();

Linus Torvalds

unread,

Mar 17, 2009, 12:00:42 PM3/17/09

to Ingo Molnar, Peter Zijlstra, Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

On Tue, 17 Mar 2009, Ingo Molnar wrote:
>
> Cool. Will you apply it yourself (in the merge window) or should
> we pick it up?

I'll commit it. I already split it into two commits - one for the trivial
startup problem that John had, one for the "estimate error and exit when
smaller than 500ppm" part.

> Incidentally, yesterday i wrote a PIT auto-calibration routine
> (see WIP patch below).
>
> The core idea is to use _all_ thousands of measurement points
> (not just two) to calculate the frequency ratio, with a built-in
> noise detector which drops out of the loop if the observed noise
> goes below ~10 ppm.

I suspect that reaching 10 ppm is going to take too long in general.
Considering that I found a machine where reaching 500ppm took 16ms,
getting to 10ppm would take almost a second. That's a long time at bootup,
considering that people want the whole boot to take about that time ;)

I also do think it's a bit unnecessarily complicated. We really only care
about the end points - obviously we can end up being unlucky and get a
very noisy end-point due to something like SMI or virtualization, but if
that happens, we're really just better off failing quickly instead, and
we'll go on to the slower calibration routines.

On real hardware without SMI or virtualization overhead, the delays
_should_ be very stable. On my main machine, for example, the PIT read
really seems very stable at about 2.5us (which matches the expectation
that one 'inb' should take roughly one microsecond pretty closely). So
that should be the default case, and the case that the fast calibration is
designed for.

For the other cases, we really can just exit and do something else.

> It's WIP because it's not working yet (or at all?): i couldnt
> get the statistical model right - it's too noisy at 1000-2000
> ppm and the frequency result is off by 5000 ppm.

I suspect your measurement overhead is getting noticeable. You do all
those divides, but even more so, you do all those traces. Also, it looks
like you do purely local pairwise analysis at subsequent PIT modelling
points, which can't work - you need to average over a long time to
stabilize it.

So you _can_ do something like what you do, but you'd need to find a
low-noise start and end point, and do analysis over that longer range
instead of trying to do it over individual cases.

> I also like yours more because it's simpler.

In fact, it's much simpler than what we used to do. No real assumptions
about how quickly we can read the PIT, no need for magic values ("we can
distinguish a slow virtual environment from real hardware by the fact that
we can do at least 50 PIT reads in one cycle"), no nothing. Just a simple
"is it below 500ppm yet?".

(Well, technically, it compares to 1 in 2048 rather than 500 in a million,
since that is much cheaper, so it's really looking for "better than
488ppm")

Linus

Ingo Molnar

unread,

Mar 17, 2009, 12:14:01 PM3/17/09

to Linus Torvalds, Peter Zijlstra, Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> On Tue, 17 Mar 2009, Ingo Molnar wrote:
> >
> > Cool. Will you apply it yourself (in the merge window) or should
> > we pick it up?
>
> I'll commit it. I already split it into two commits - one for the
> trivial startup problem that John had, one for the "estimate error
> and exit when smaller than 500ppm" part.

ok.

> > Incidentally, yesterday i wrote a PIT auto-calibration routine
> > (see WIP patch below).
> >
> > The core idea is to use _all_ thousands of measurement points
> > (not just two) to calculate the frequency ratio, with a built-in
> > noise detector which drops out of the loop if the observed noise
> > goes below ~10 ppm.
>
> I suspect that reaching 10 ppm is going to take too long in
> general. Considering that I found a machine where reaching 500ppm
> took 16ms, getting to 10ppm would take almost a second. That's a
> long time at bootup, considering that people want the whole boot
> to take about that time ;)
>
> I also do think it's a bit unnecessarily complicated. We really
> only care about the end points - obviously we can end up being
> unlucky and get a very noisy end-point due to something like SMI
> or virtualization, but if that happens, we're really just better
> off failing quickly instead, and we'll go on to the slower
> calibration routines.

That's the idea of my patch: to use not two endpoints but thousands
of measurement points. That way we dont have to worry about the
precision of the endpoints - any 'bad' measurement will be
counter-acted by thousands of 'good' measurements.

That's the theory at least - practice got in my way ;-)

By measuring more we can get a more precise result, and we also do
not assume anything about how much time passes between two
measurement points. A single measurement is:

+ /*
+ * We use the PIO accesses as natural TSC serialization barriers:
+ */
+ pit_lsb = inb(0x42);
+ tsc = get_cycles();
+ pit_msb = inb(0x42);

Just like we can prove that there's an exoplanet around a star, just
by doing a _ton_ of measurements of a very noisy data source. As
long as there's an underlying physical value to be measured (and we
are not measuring pure noise) that value is recoverable, with enough
measurements.

> On real hardware without SMI or virtualization overhead, the
> delays _should_ be very stable. On my main machine, for example,
> the PIT read really seems very stable at about 2.5us (which
> matches the expectation that one 'inb' should take roughly one
> microsecond pretty closely). So that should be the default case,
> and the case that the fast calibration is designed for.
>
> For the other cases, we really can just exit and do something
> else.
>
> > It's WIP because it's not working yet (or at all?): i couldnt
> > get the statistical model right - it's too noisy at 1000-2000
> > ppm and the frequency result is off by 5000 ppm.
>
> I suspect your measurement overhead is getting noticeable. You do
> all those divides, but even more so, you do all those traces.
> Also, it looks like you do purely local pairwise analysis at
> subsequent PIT modelling points, which can't work - you need to
> average over a long time to stabilize it.

Actually, it's key to my trick that what happens _between_ the
measurement points does not matter _at all_.

My 'delta' algorithm does not assume anything about how much time
passes between two measurement points - it calculates the slope and
keeps a rolling average of that slope.

That's why i could put the delta analysis there. We are capturing
thousands of measurement points, and what matters is the precision
of the 'pair' of (PIT,TSC) timestamp measurements.

I got roughly the same end result noise and the same anomalies with
tracing enabled and disabled. (and the number of data points was cut
in half with tracing enabled)

Ingo

Linus Torvalds

unread,

Mar 17, 2009, 12:33:31 PM3/17/09

to Ingo Molnar, Peter Zijlstra, Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

On Tue, 17 Mar 2009, Ingo Molnar wrote:
>
> That's the idea of my patch: to use not two endpoints but thousands
> of measurement points.

Umm. Except you don't.

> By measuring more we can get a more precise result, and we also do
> not assume anything about how much time passes between two
> measurement points.

That's fine, but your actual code doesn't _do_ that.

> My 'delta' algorithm does not assume anything about how much time
> passes between two measurement points - it calculates the slope and
> keeps a rolling average of that slope.

No, you keep a very bad measure of "some kind of random average of the
last few points", which - if I read things right:

- lacks precision (you really need to use 'double' floating point to do
it well, otherwise the rounding errors will kill you). You seem to be
aiming for a 10-bit fixed point thing, which may or may not work if
done cleverly, but:

- seems to be based on a rather weak averaging function which certainly
will lose data over time.

The thing is, the only _accurate_ average is the one done over long time
distances. It's very true that your slope thing works very well over such
long times, and you'd get accurate measurement if you did it that way, BUT
THAT IS NOT WHAT YOU DO. You have a very tight loop, so you get very bad
slopes, and then you use a weak averaging function to try to make them
better, but it never does.

Also, there seems to be a fundamental bug in your PIT reading routine. My
fast-TSC calibration only looks at the MSB of the PIT read for a very good
reason: if you don't use the explicit LATCH command, you may be getting
the MSB of one counter value, and then the LSB of another. So your PIT
read can easily be off by ~256 PIT cycles. Only by caring only for the MSB
can you do an unlatched read!

That is why pit_expect_msb() looks for the "edge" where the MSB changes,
and never actually looks at the LSB.

This issue may be an additional reason for your problems, although maybe
your noise correction will be able to avoid those cases.

Linus

Ingo Molnar

unread,

Mar 17, 2009, 12:41:32 PM3/17/09

to Linus Torvalds, Peter Zijlstra, Jesper Krogh, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

* Linus Torvalds <torv...@linux-foundation.org> wrote:

Hm, the intention there was to have a memory of ~1000 entries via a
decaying average of 1:1000.

In parallel to that there's also a noise estimator (which too decays
over time). So basically when observed noise is very low we
essentially use the data from the last ~1000 measurements. (well,
not exactly - as the 'memory' of more recent data will be stronger
than that of older ones.)

Again ... it's a clearly non-working patch so it's not really a
defendable concept :-)

> Also, there seems to be a fundamental bug in your PIT reading
> routine. My fast-TSC calibration only looks at the MSB of the PIT
> read for a very good reason: if you don't use the explicit LATCH
> command, you may be getting the MSB of one counter value, and then
> the LSB of another. So your PIT read can easily be off by ~256 PIT
> cycles. Only by caring only for the MSB can you do an unlatched
> read!
>
> That is why pit_expect_msb() looks for the "edge" where the MSB
> changes, and never actually looks at the LSB.
>
> This issue may be an additional reason for your problems, although
> maybe your noise correction will be able to avoid those cases.

indeed. I did check the trace results though via gnuplot yesterday
(suspectig PIT readout outliers) and there were no outliers.

For any final patch it's still a showstopper issue.

But the source of error and miscalibration is elsewhere.

Ingo

Olivier Galibert

unread,

Mar 17, 2009, 1:28:29 PM3/17/09

to Linux Kernel Mailing List

On Tue, Mar 17, 2009 at 05:13:22PM +0100, Ingo Molnar wrote:
> That's why i could put the delta analysis there. We are capturing
> thousands of measurement points, and what matters is the precision
> of the 'pair' of (PIT,TSC) timestamp measurements.
>
> I got roughly the same end result noise and the same anomalies with
> tracing enabled and disabled. (and the number of data points was cut
> in half with tracing enabled)

Any reason for not doing a bog-standard linear regression?

OG.

Ingo Molnar

unread,

Mar 21, 2009, 6:07:30 AM3/21/09

to Jesper Krogh, Linus Torvalds, john stultz, Thomas Gleixner, Linux Kernel Mailing List, Len Brown

* Jesper Krogh <jes...@krogh.cc> wrote:

> Linus Torvalds wrote:
>>
>> On Mon, 16 Mar 2009, Jesper Krogh wrote:
>>> you were right. It works. No resets so far.
>>
>> Goodie.
>>
>> Here's a slightly cleaned-up patch that removes the debug messages, and
>> also re-organizes the code a bit so that it actually uses the "better
>> than 500 ppm" as the way to decide when to stop calibrating.
>

> Can we ship:
> commit a6a80e1d8cf82b46a69f88e659da02749231eb36
> Author: Linus Torvalds <torv...@linux-foundation.org>
> Date: Tue Mar 17 07:58:26 2009 -0700
>
> Fix potential fast PIT TSC calibration startup glitch
>
> and
> commit 9e8912e04e612b43897b4b722205408b92f423e5
> Author: Linus Torvalds <torv...@linux-foundation.org>
> Date: Tue Mar 17 08:13:17 2009 -0700
>
> Fast TSC calibration: calculate proper frequency error bounds
>
>
> to the 2.6.28-stable series.. The first one needed to apply the second.

Yes, would be nice to have these fixes in .28.9.

Ingo