Re: kernel bug in kvm

Avi Kivity

unread,

Nov 1, 2009, 5:30:01 AM11/1/09

to

On 11/01/2009 12:00 PM, Tejun Heo wrote:
> Hello,
>
> Avi Kivity wrote:
>
>> Only, that merge doesn't change virt/kvm or arch/x86/kvm.
>>
>> Tejun, anything known bad about that merge? ada3fa15 kills kvm.
>>
> Nothing rings a bell at the moment. How does it kill kvm? One big
> difference caused by that merge is use of sparse areas near the top of
> vmalloc area. This caused vmalloc area shortage on sparc64 and
> exposed paging code bug on ppc64 which caused the cpu to fault
> repeatedly on the same address. Maybe something similiar is happening
> with kvm?
>
>

We get a page fault immediately (next instruction) after returning from
the guest when running with oprofile. The page fault address does not
match anything the instruction does, so presumably it is one of the
accesses the processor performs in order to service an NMI (ordinary
interrupts are masked; and the fact that it happens with oprofile
strengthens this assumption).

If this is correct, the fault is not in the NMI handler itself, but in
one of the memory areas the cpu looks in to vector the NMI, which can be:

- the IDT
- the GDT
- the TSS
- the NMI stack

Except for the IDT these are per-cpu structure, though I don't know
whether they are allocated with the percpu infrastructure.

Here is the code in question:

> 3ae7: 75 05 jne 3aee<vmx_vcpu_run+0x26a>
> 3ae9: 0f 01 c2 vmlaunch
> 3aec: eb 03 jmp 3af1<vmx_vcpu_run+0x26d>
> 3aee: 0f 01 c3 vmresume
> 3af1: 48 87 0c 24 xchg %rcx,(%rsp)

^^^ fault, but not at (%rsp)

> 3af5: 48 89 81 18 01 00 00 mov %rax,0x118(%rcx)
> 3afc: 48 89 99 30 01 00 00 mov %rbx,0x130(%rcx)

--

error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Tejun Heo

unread,

Nov 1, 2009, 5:50:02 AM11/1/09

to

Hello,

Avi Kivity wrote:
> We get a page fault immediately (next instruction) after returning from
> the guest when running with oprofile. The page fault address does not
> match anything the instruction does, so presumably it is one of the
> accesses the processor performs in order to service an NMI (ordinary
> interrupts are masked; and the fact that it happens with oprofile
> strengthens this assumption).

Ah... okay, that's tricky but IIRC faults like that can be
distinguished from regular ones via processor state, right?

> If this is correct, the fault is not in the NMI handler itself, but in
> one of the memory areas the cpu looks in to vector the NMI, which can be:
>
> - the IDT
> - the GDT
> - the TSS
> - the NMI stack
>
> Except for the IDT these are per-cpu structure, though I don't know
> whether they are allocated with the percpu infrastructure.

Don't know where NMI stack is but all else are percpu.

> Here is the code in question:
>
>> 3ae7: 75 05 jne 3aee<vmx_vcpu_run+0x26a>
>> 3ae9: 0f 01 c2 vmlaunch
>> 3aec: eb 03 jmp 3af1<vmx_vcpu_run+0x26d>
>> 3aee: 0f 01 c3 vmresume
>> 3af1: 48 87 0c 24 xchg %rcx,(%rsp)
>
> ^^^ fault, but not at (%rsp)

Can you please post the full oops (including kernel debug messages
during boot) or give me a pointer to the original message? Also, does
the faulting address coincide with any symbol?

Thanks.

--
tejun

Avi Kivity

unread,

Nov 1, 2009, 6:40:02 AM11/1/09

to

On 11/01/2009 12:45 PM, Tejun Heo wrote:
> Hello,
>
> Avi Kivity wrote:
>
>> We get a page fault immediately (next instruction) after returning from
>> the guest when running with oprofile. The page fault address does not
>> match anything the instruction does, so presumably it is one of the
>> accesses the processor performs in order to service an NMI (ordinary
>> interrupts are masked; and the fact that it happens with oprofile
>> strengthens this assumption).
>>
> Ah... okay, that's tricky but IIRC faults like that can be
> distinguished from regular ones via processor state, right?
>

Not on x86. But given that the fault address is different from %rsp
(which is what the instruction accesses) and %rip, there aren't many
alternatives.

>> Here is the code in question:
>>
>>
>>> 3ae7: 75 05 jne 3aee<vmx_vcpu_run+0x26a>
>>> 3ae9: 0f 01 c2 vmlaunch
>>> 3aec: eb 03 jmp 3af1<vmx_vcpu_run+0x26d>
>>> 3aee: 0f 01 c3 vmresume
>>> 3af1: 48 87 0c 24 xchg %rcx,(%rsp)
>>>
>> ^^^ fault, but not at (%rsp)
>>
> Can you please post the full oops (including kernel debug messages
> during boot) or give me a pointer to the original message?

http://www.mail-archive.com/k...@vger.kernel.org/msg23458.html

> Also, does
> the faulting address coincide with any symbol?
>

No (at least, not in System.map).

--
error compiling committee.c: too many arguments to function

--

Tejun Heo

unread,

Nov 18, 2009, 4:30:02 AM11/18/09

to

Hello,

11/01/2009 08:31 PM, Avi Kivity wrote:
>>> Here is the code in question:
>>>
>>>
>>>> 3ae7: 75 05 jne
>>>> 3aee<vmx_vcpu_run+0x26a>
>>>> 3ae9: 0f 01 c2 vmlaunch
>>>> 3aec: eb 03 jmp
>>>> 3af1<vmx_vcpu_run+0x26d>
>>>> 3aee: 0f 01 c3 vmresume
>>>> 3af1: 48 87 0c 24 xchg %rcx,(%rsp)
>>>>
>>> ^^^ fault, but not at (%rsp)
>>>
>> Can you please post the full oops (including kernel debug messages
>> during boot) or give me a pointer to the original message?
>
> http://www.mail-archive.com/k...@vger.kernel.org/msg23458.html
>
>> Also, does
>> the faulting address coincide with any symbol?
>>
>
> No (at least, not in System.map).

Has there been any progress? Is kvm + oprofile still broken?

Thanks.

--
tejun

Andrew Theurer

unread,

Nov 25, 2009, 8:40:02 PM11/25/09

to

Tejun Heo wrote:
> Hello,
>
> 11/01/2009 08:31 PM, Avi Kivity wrote:
>>>> Here is the code in question:
>>>>
>>>>
>>>>> 3ae7: 75 05 jne
>>>>> 3aee<vmx_vcpu_run+0x26a>
>>>>> 3ae9: 0f 01 c2 vmlaunch
>>>>> 3aec: eb 03 jmp
>>>>> 3af1<vmx_vcpu_run+0x26d>
>>>>> 3aee: 0f 01 c3 vmresume
>>>>> 3af1: 48 87 0c 24 xchg %rcx,(%rsp)
>>>>>
>>>> ^^^ fault, but not at (%rsp)
>>>>
>>> Can you please post the full oops (including kernel debug messages
>>> during boot) or give me a pointer to the original message?
>> http://www.mail-archive.com/k...@vger.kernel.org/msg23458.html
>>
>>> Also, does
>>> the faulting address coincide with any symbol?
>>>
>> No (at least, not in System.map).
>
> Has there been any progress? Is kvm + oprofile still broken?
>

I just tried testing tip of kvm.git, but unfortunately I think I might
be hitting a different problem, where processes run 100% in kernel mode.
In my case, cpus 9 and 13 were stuck, running qemu processes. A stack
backtrace for both cpus are below. FWIW, kernel.org 2.6.32-rc7 does not
have this problem, or the original problem.

> NMI backtrace for cpu 9
> CPU 9:
> Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: processor]
> Pid: 5687, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 -[7947AC1]-
> RIP: 0010:[<ffffffff810b802b>] [<ffffffff810b802b>] fire_user_return_notifiers+0x31/0x36
> RSP: 0018:ffff88095024df08 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000000000800 RCX: ffff88095024c000
> RDX: ffff880028340000 RSI: 0000000000000000 RDI: ffff88095024df58
> RBP: ffff88095024df18 R08: 0000000000000000 R09: 0000000000000001
> R10: 000000caf1fff62d R11: ffff8805b584de40 R12: 00007fffae48e0f0
> R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> FS: 00007f45c69d57c0(0000) GS:ffff880028340000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: fffff9800121056e CR3: 0000000953d36000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
> <#DB[1]> <<EOE>> Pid: 5687, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1
> Call Trace:
> <NMI> [<ffffffff8100af53>] ? show_regs+0x44/0x49
> [<ffffffff812e57b2>] nmi_watchdog_tick+0xc2/0x1b9
> [<ffffffff812e4e73>] do_nmi+0xb0/0x252
> [<ffffffff812e48a0>] nmi+0x20/0x30
> [<ffffffff810b802b>] ? fire_user_return_notifiers+0x31/0x36
> <<EOE>> [<ffffffff8100b844>] do_notify_resume+0x62/0x69
> [<ffffffff8100bf48>] ? int_check_syscall_exit_work+0x9/0x3d
> [<ffffffff8100bf8e>] int_signal+0x12/0x17

> NMI backtrace for cpu 13
> CPU 13:
> Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: processor]
> Pid: 5792, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 -[7947AC1]-
> RIP: 0010:[<ffffffff8100bfb0>] [<ffffffff8100bfb0>] int_restore_rest+0x1d/0x3d
> RSP: 0018:ffff88124f491f58 EFLAGS: 00000292
> RAX: 0000000000000800 RBX: 00007fff9df852e0 RCX: ffff88124f490000
> RDX: ffff88099ff40000 RSI: 0000000000000000 RDI: 000000000000fe2e
> RBP: 00007fff9df85260 R08: ffff88124f490000 R09: 0000000000000000
> R10: 0000000000000005 R11: ffff880954971da0 R12: 00007fff9df851e0
> R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> FS: 00007f73b5b1d7c0(0000) GS:ffff88099ff40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f8d5a8de9d0 CR3: 0000000eb34d7000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
> <#DB[1]> <<EOE>> Pid: 5792, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1
> Call Trace:
> <NMI> [<ffffffff8100af53>] ? show_regs+0x44/0x49
> [<ffffffff812e57b2>] nmi_watchdog_tick+0xc2/0x1b9
> [<ffffffff812e4e73>] do_nmi+0xb0/0x252
> [<ffffffff812e48a0>] nmi+0x20/0x30
> [<ffffffff8100bfb0>] ? int_restore_rest+0x1d/0x3d
> <<EOE>>

-Andrew

Tejun Heo

unread,

Nov 25, 2009, 8:50:01 PM11/25/09

to

Hello,

11/26/2009 10:35 AM, Andrew Theurer wrote:
> I just tried testing tip of kvm.git, but unfortunately I think I might
> be hitting a different problem, where processes run 100% in kernel mode.
> In my case, cpus 9 and 13 were stuck, running qemu processes. A stack
> backtrace for both cpus are below. FWIW, kernel.org 2.6.32-rc7 does not
> have this problem, or the original problem.

2.6.32-rc7 doesn't have problem with kvm + oprofile? If the original
analysis was right, I can't think of anything which could have changed
that between the merge commit and 2.6.32-rc7.

Thanks.

--
tejun

Avi Kivity

unread,

Nov 26, 2009, 5:40:02 AM11/26/09

to

That's a bug with the new user return notifiers. Is your host kernel
preemptible?

I think I saw this once but I'm not sure. I can't reproduce with a host
kernel build, some silly guest workload, and 'perf top' to generate an
nmi load.

--
error compiling committee.c: too many arguments to function

--

Andrew Theurer

unread,

Nov 26, 2009, 8:50:02 AM11/26/09

to

preempt is off.

>
> I think I saw this once but I'm not sure. I can't reproduce with a host
> kernel build, some silly guest workload, and 'perf top' to generate an
> nmi load.
>

-Andrew

Avi Kivity

unread,

Nov 29, 2009, 9:50:02 AM11/29/09

to

On 11/26/2009 03:35 AM, Andrew Theurer wrote:

> I just tried testing tip of kvm.git, but unfortunately I think I might
> be hitting a different problem, where processes run 100% in kernel
> mode. In my case, cpus 9 and 13 were stuck, running qemu processes.
> A stack backtrace for both cpus are below. FWIW, kernel.org
> 2.6.32-rc7 does not have this problem, or the original problem.

I just posted a patch fixing this, titled "[PATCH tip:x86/entry] core:
fix user return notifier on fork()".

--
error compiling committee.c: too many arguments to function

--

Andrew Theurer

unread,

Nov 30, 2009, 11:30:02 AM11/30/09

to

On Sun, 2009-11-29 at 16:46 +0200, Avi Kivity wrote:
> On 11/26/2009 03:35 AM, Andrew Theurer wrote:
> > I just tried testing tip of kvm.git, but unfortunately I think I might
> > be hitting a different problem, where processes run 100% in kernel
> > mode. In my case, cpus 9 and 13 were stuck, running qemu processes.
> > A stack backtrace for both cpus are below. FWIW, kernel.org
> > 2.6.32-rc7 does not have this problem, or the original problem.
>
> I just posted a patch fixing this, titled "[PATCH tip:x86/entry] core:
> fix user return notifier on fork()".
>

Thank you, Avi. I am running on this patch and am not seeing this
problem anymore. I'll be testing for the previous issue next.

-Andrew

Re: kernel bug in kvm_intel

Avi Kivity

Tejun Heo

Avi Kivity

Tejun Heo

Andrew Theurer

Tejun Heo

Avi Kivity

Andrew Theurer

Avi Kivity

Andrew Theurer