Performance overhead of paravirt_ops on native identified

Jeremy Fitzhardinge

unread,

May 13, 2009, 8:20:08 PM5/13/09

to

Hi Ingo,

Xiaohui Xin and some other folks at Intel have been looking into what's
behind the performance hit of paravirt_ops when running native.

It appears that the hit is entirely due to the paravirtualized
spinlocks; the extra call/return in the spinlock path is somehow
causing an increase in the cycles/instruction of somewhere around 2-7%
(seems to vary quite a lot from test to test). The working theory is
that the CPU's pipeline is getting upset about the
call->call->locked-op->return->return, and seems to be failing to
speculate (though I haven't seen anything definitive about the precise
reasons). This doesn't entirely make sense, because the performance
hit is also visible on unlock and other operations which don't involve
locked instructions. But spinlock operations clearly swamp all the
other pvops operations, even though I can't imagine that they're
nearly as common (there's only a .05% increase in instructions
executed).

If I disable just the pv-spinlock calls, my tests show that pvops is
identical to non-pvops performance on native (my measurements show that
it is actually about .1% faster, but Xiaohui shows a .05% slowdown).

Summary of results, averaging 10 runs of the "mmperf" test, using a
no-pvops build as baseline:

nopv Pv-nospin Pv-spin
CPU cycles 100.00% 99.89% 102.18%
instructions 100.00% 100.10% 100.15%
CPI 100.00% 99.79% 102.03%
cache ref 100.00% 100.84% 100.28%
cache miss 100.00% 90.47% 88.56%
cache miss rate 100.00% 89.72% 88.31%
branches 100.00% 99.93% 100.04%
branch miss 100.00% 103.66% 107.72%
branch miss rt 100.00% 103.73% 107.67%
wallclock 100.00% 99.90% 102.20%

The clear effect here is that the 2% increase in CPI is
directly reflected in the final wallclock time.

(The other interesting effect is that the more ops are
out of line calls via pvops, the lower the cache access
and miss rates. Not too surprising, but it suggests that
the non-pvops kernel is over-inlined. On the flipside,
the branch misses go up correspondingly...)

So, what's the fix?

Paravirt patching turns all the pvops calls into direct calls, so
_spin_lock etc do end up having direct calls. For example, the compiler
generated code for paravirtualized _spin_lock is:

<_spin_lock+0>: mov %gs:0xb4c8,%rax
<_spin_lock+9>: incl 0xffffffffffffe044(%rax)
<_spin_lock+15>: callq *0xffffffff805a5b30
<_spin_lock+22>: retq

The indirect call will get patched to:
<_spin_lock+0>: mov %gs:0xb4c8,%rax
<_spin_lock+9>: incl 0xffffffffffffe044(%rax)
<_spin_lock+15>: callq <__ticket_spin_lock>
<_spin_lock+20>: nop; nop /* or whatever 2-byte nop */
<_spin_lock+22>: retq

One possibility is to inline _spin_lock, etc, when building an
optimised kernel (ie, when there's no spinlock/preempt
instrumentation/debugging enabled). That will remove the outer
call/return pair, returning the instruction stream to a single
call/return, which will presumably execute the same as the non-pvops
case. The downsides arel 1) it will replicate the
preempt_disable/enable code at eack lock/unlock callsite; this code is
fairly small, but not nothing; and 2) the spinlock definitions are
already a very heavily tangled mass of #ifdefs and other preprocessor
magic, and making any changes will be non-trivial.

The other obvious answer is to disable pv-spinlocks. Making them a
separate config option is fairly easy, and it would be trivial to
enable them only when Xen is enabled (as the only non-default user).
But it doesn't really address the common case of a distro build which
is going to have Xen support enabled, and leaves the open question of
whether the native performance cost of pv-spinlocks is worth the
performance improvement on a loaded Xen system (10% saving of overall
system CPU when guests block rather than spin). Still it is a
reasonable short-term workaround.

The best solution would be to work out whether this really is a problem
interaction with Intel's pipelines, and come up with something that
avoids it. It would be very interesting to see if there's a similar hit
on AMD systems.

J

From 839033e472c8f3b228be35e57a8b31fbb7f9cf98 Mon Sep 17 00:00:00 2001
From: Jeremy Fitzhardinge <jeremy.fi...@citrix.com>
Date: Wed, 13 May 2009 11:58:17 -0700
Subject: [PATCH] x86: add config to disable PV spinlocks

Paravirtualized spinlocks seem to cause a 2-7% performance hit when
running a pvops kernel native. Without them, the pvops kernel is
identical to a non-pvops kernel in performance.

[ Impact: reduce overhead of pvops when running native ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fi...@citrix.com>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5f50179..a99ed71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -498,6 +498,17 @@ config PARAVIRT
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.

+config PARAVIRT_SPINLOCKS
+ bool "Enable paravirtualized spinlocks"
+ depends on PARAVIRT && SMP
+ default XEN
+ ---help---
+ Paravirtualized spinlocks allow a pvops backend to replace the
+ spinlock implementation with something virtualization-friendly
+ (for example, block the virtual CPU rather than spinning).
+ Unfortunately the downside is an as-yet unexplained performance
+ when running native.
+
config PARAVIRT_CLOCK
bool
default n
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1fe5837..4fb37c8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -1443,7 +1443,7 @@ u64 _paravirt_ident_64(u64);

#define paravirt_nop ((void *)_paravirt_nop)

-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)

static inline int __raw_spin_is_locked(struct raw_spinlock *lock)
{
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e5e6caf..b7e5db8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -172,7 +172,7 @@ static inline int __ticket_spin_is_contended(raw_spinlock_t *lock)
return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
}

-#ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PARAVIRT_SPINLOCKS

static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
{
@@ -206,7 +206,7 @@ static __always_inline void __raw_spin_lock_flags(raw_spinlock_t *lock,
__raw_spin_lock(lock);
}

-#endif
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */

static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
{
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 68a4ff6..4f78bd6 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -90,7 +90,8 @@ obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o
obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o
obj-$(CONFIG_KVM_GUEST) += kvm.o
obj-$(CONFIG_KVM_CLOCK) += kvmclock.o
-obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o

obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index aa34423..70ec9b9 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -134,7 +134,9 @@ static void *get_call_destination(u8 type)
.pv_irq_ops = pv_irq_ops,
.pv_apic_ops = pv_apic_ops,
.pv_mmu_ops = pv_mmu_ops,
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
+#endif
};
return *((void **)&tmpl + type);
}
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 3b767d0..172438f 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -9,5 +9,6 @@ obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \
time.o xen-asm.o xen-asm_$(BITS).o \
grant-table.o suspend.o

-obj-$(CONFIG_SMP) += smp.o spinlock.o
-obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
\ No newline at end of file
+obj-$(CONFIG_SMP) += smp.o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
+obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 5c50a10..22494fd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -61,15 +61,26 @@ void xen_setup_vcpu_info_placement(void);
#ifdef CONFIG_SMP
void xen_smp_init(void);

-void __init xen_init_spinlocks(void);
-__cpuinit void xen_init_lock_cpu(int cpu);
-void xen_uninit_lock_cpu(int cpu);
-
extern cpumask_var_t xen_cpu_initialized_map;
#else
static inline void xen_smp_init(void) {}
#endif

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init xen_init_spinlocks(void);
+__cpuinit void xen_init_lock_cpu(int cpu);
+void xen_uninit_lock_cpu(int cpu);
+#else
+static inline void xen_init_spinlocks(void)
+{
+}
+static inline void xen_init_lock_cpu(int cpu)
+{
+}
+static inline void xen_uninit_lock_cpu(int cpu)
+{
+}
+#endif

/* Declare an asm function, along with symbols needed to make it
inlineable */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

H. Peter Anvin

unread,

May 13, 2009, 9:20:08 PM5/13/09

to

The other obvious option, it would seem to me, would be to eliminate the
*inner* call/return pair, i.e. merging the _spin_lock setup code in with
the internals of each available implementation (in the case above,
__ticket_spin_lock). This is effectively what happens on native. The
one problem with that is that every callsite now becomes a patching target.

That brings me to a somewhat half-arsed thought I have been walking
around with for a while.

Consider a paravirt -- or for that matter any other call which is
runtime-static; this isn't just limited to paravirt -- function which
looks to the C compiler just like any other external function -- no
indirection. We can point it by default to a function which is really
just an indirect jump to the appropriate handler, that handles the
prepatching case. However, a linktime pass over vmlinux.o can find all
the points where this function is called, and turn it into a list of
patch sites(*). The advantages are:

1. [minor] no additional nop padding due to indirect function calls.
2. [major] no need for a ton of wrapper macros manifest in the code.

paravirt_ops that turn into pure inline code in the native case is
obviously another ball of wax entirely; there inline assembly wrappers
are simply unavoidable.

-hpa

(*) if patching code on SMP was cheaper, we could actually do this
lazily, and wouldn't have to store a list of patch sites. I don't feel
brave enough to go down that route.

Jan Beulich

unread,

May 14, 2009, 4:10:10 AM5/14/09

to

>>> Jeremy Fitzhardinge <jer...@goop.org> 14.05.09 02:16 >>>

>One possibility is to inline _spin_lock, etc, when building an
>optimised kernel (ie, when there's no spinlock/preempt
>instrumentation/debugging enabled). That will remove the outer
>call/return pair, returning the instruction stream to a single
>call/return, which will presumably execute the same as the non-pvops
>case. The downsides arel 1) it will replicate the
>preempt_disable/enable code at eack lock/unlock callsite; this code is
>fairly small, but not nothing; and 2) the spinlock definitions are
>already a very heavily tangled mass of #ifdefs and other preprocessor
>magic, and making any changes will be non-trivial.
>
>The other obvious answer is to disable pv-spinlocks. Making them a
>separate config option is fairly easy, and it would be trivial to
>enable them only when Xen is enabled (as the only non-default user).
>But it doesn't really address the common case of a distro build which
>is going to have Xen support enabled, and leaves the open question of
>whether the native performance cost of pv-spinlocks is worth the
>performance improvement on a loaded Xen system (10% saving of overall
>system CPU when guests block rather than spin). Still it is a
>reasonable short-term workaround.

Wouldn't a third solution be to use ticket spinlocks everywhere, i.e. eliminate
the current indirection, and replace it by an indirection for just the contention
case? As I view it, the problem for Xen aren't really the ticket locks by
themselves, but rather the extra spinning involved, which is of concern only
if a lock is contended. We're using ticket locks quite happily in our kernels,
with directed instead of global wakeup from the unlock path. The only open
issue we currently have is that while for native keeping interrupts disabled
while spinning may be acceptable (though I'm not sure how -rt folks are
viewing this), in a pv environment one should really re-enable interrupts
here due to the potentially much higher latency.

Jan

Peter Zijlstra

unread,

May 14, 2009, 4:30:19 AM5/14/09

to

This sounds remarkably like what the dynamic function call tracer does.

Peter Zijlstra

unread,

May 14, 2009, 4:40:10 AM5/14/09

to

On Thu, 2009-05-14 at 09:05 +0100, Jan Beulich wrote:
> Wouldn't a third solution be to use ticket spinlocks everywhere, i.e. eliminate
> the current indirection, and replace it by an indirection for just the contention
> case? As I view it, the problem for Xen aren't really the ticket locks by
> themselves, but rather the extra spinning involved, which is of concern only
> if a lock is contended. We're using ticket locks quite happily in our kernels,
> with directed instead of global wakeup from the unlock path. The only open
> issue we currently have is that while for native keeping interrupts disabled
> while spinning may be acceptable (though I'm not sure how -rt folks are
> viewing this), in a pv environment one should really re-enable interrupts
> here due to the potentially much higher latency.

the -rt folks don't nearly have as many spinlocks, and for those we do
like ticket locks, because they are much fairer and give better worst
case contention behaviour.

Also, for the -rt folks, preempt disable is about as bad as irq disable.

H. Peter Anvin

unread,

May 14, 2009, 10:20:07 AM5/14/09

to

Peter Zijlstra wrote:
>
> This sounds remarkably like what the dynamic function call tracer does.
>

I'm sure this has been invented before... probably more than once. Far
too many things we invent don't get generalized.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Jeremy Fitzhardinge

unread,

May 14, 2009, 1:40:12 PM5/14/09

to

H. Peter Anvin wrote:
> The other obvious option, it would seem to me, would be to eliminate the
> *inner* call/return pair, i.e. merging the _spin_lock setup code in with
> the internals of each available implementation (in the case above,
> __ticket_spin_lock). This is effectively what happens on native. The
> one problem with that is that every callsite now becomes a patching target.
>

Yes, that's an option. It has the downside of requiring changes to the
common spinlock code in kernel/spinlock.c and linux/spinlock_api*.h.
The amount of duplicated code is potentially quite large, but there
aren't that many spinlock implementations.

Also, there's not much point in using pv spinlocks when all the
instrumentation is on. Lock contention metering, for example, never
does a proper lock operation, but does a spin with repeated trylocks; we
can't optimise that, so there's no point in trying.

So maybe if we can fast-path the fast-path to pv spinlocks, the problem
is more tractable...

> That brings me to a somewhat half-arsed thought I have been walking
> around with for a while.
>
> Consider a paravirt -- or for that matter any other call which is
> runtime-static; this isn't just limited to paravirt -- function which
> looks to the C compiler just like any other external function -- no
> indirection. We can point it by default to a function which is really
> just an indirect jump to the appropriate handler, that handles the
> prepatching case. However, a linktime pass over vmlinux.o can find all
> the points where this function is called, and turn it into a list of
> patch sites(*). The advantages are:
>
> 1. [minor] no additional nop padding due to indirect function calls.
> 2. [major] no need for a ton of wrapper macros manifest in the code.
>
> paravirt_ops that turn into pure inline code in the native case is
> obviously another ball of wax entirely; there inline assembly wrappers
> are simply unavoidable.
>

We did consider something like this at the outset. As I remember, there
were a few concerns:

* There was no relocation data available in the kernel. I played
around with ways to make it work, but they ended up being fairly
complex and brittle, with a tendency (of course) to trigger
binutils bugs. Maybe that has changed.
* We didn't really want to implement two separate mechanisms for the
same thing. Given that we wanted to inline things like
cli/sti/pushf/popf, we needed to have something capable of full
patching. Having a separate mechanisms for patching calls is
harder to justify. Now that pvops is well settled, perhaps it
makes sense to consider adding another more general patching
mechanism to avoid the indirect calls (a dynamic linker, essentially).

I won't make any great claims about the beauty of the PV_CALL* gunk, but
at the very least it is contained within paravirt.h.

> (*) if patching code on SMP was cheaper, we could actually do this
> lazily, and wouldn't have to store a list of patch sites. I don't feel
> brave enough to go down that route.
>

The problem that the tracepoints people were trying to solve was harder,
where they wanted to replace an arbitrary set of instructions with some
other arbitrary instructions (or a call) - that would need some kind SMP
synchronization, both for general sanity and to keep the Intel rules happy.

In theory relinking a call should just be a single word write into the
instruction, but I don't know if that gets into undefined territory or
not. On older P4 systems it would end up blowing away the trace cache
on all cpus when you write to code like that, so you'd want to be sure
that your references are getting resolved fairly quickly. But its hard
to see how patching the offset in a call instruction would end up
calling something other than the old or new function.

J

Jeremy Fitzhardinge

unread,

May 14, 2009, 1:50:10 PM5/14/09

to

Jan Beulich wrote:
> Wouldn't a third solution be to use ticket spinlocks everywhere, i.e. eliminate
> the current indirection, and replace it by an indirection for just the contention
> case? As I view it, the problem for Xen aren't really the ticket locks by
> themselves, but rather the extra spinning involved, which is of concern only
> if a lock is contended. We're using ticket locks quite happily in our kernels,
> with directed instead of global wakeup from the unlock path.

Do you have a patch to illustrate what you mean? How do you keep track
of the target vcpu for the directed wakeup? Are you using the
event-poll mechanism to block?

J

H. Peter Anvin

unread,

May 14, 2009, 2:00:18 PM5/14/09

to

Jeremy Fitzhardinge wrote:
>
> We did consider something like this at the outset. As I remember, there
> were a few concerns:
>
> * There was no relocation data available in the kernel. I played
> around with ways to make it work, but they ended up being fairly
> complex and brittle, with a tendency (of course) to trigger
> binutils bugs. Maybe that has changed.

We already do this pass (in fact, we do something like three passes of
it.) It's basically the vmlinux.o pass.

> * We didn't really want to implement two separate mechanisms for the
> same thing. Given that we wanted to inline things like
> cli/sti/pushf/popf, we needed to have something capable of full
> patching. Having a separate mechanisms for patching calls is
> harder to justify. Now that pvops is well settled, perhaps it
> makes sense to consider adding another more general patching
> mechanism to avoid the indirect calls (a dynamic linker, essentially).

Full patching is understandable (although I think sometimes the code
generated was worse than out-of-line... I believe you have fixed that.)

> I won't make any great claims about the beauty of the PV_CALL* gunk, but
> at the very least it is contained within paravirt.h.

There is still massive spillover into other code, though, at least some
of which could possibly be avoided. I don't know.

>> (*) if patching code on SMP was cheaper, we could actually do this
>> lazily, and wouldn't have to store a list of patch sites. I don't feel
>> brave enough to go down that route.
>>
> The problem that the tracepoints people were trying to solve was harder,
> where they wanted to replace an arbitrary set of instructions with some
> other arbitrary instructions (or a call) - that would need some kind SMP
> synchronization, both for general sanity and to keep the Intel rules happy.
>
> In theory relinking a call should just be a single word write into the
> instruction, but I don't know if that gets into undefined territory or
> not. On older P4 systems it would end up blowing away the trace cache
> on all cpus when you write to code like that, so you'd want to be sure
> that your references are getting resolved fairly quickly. But its hard
> to see how patching the offset in a call instruction would end up
> calling something other than the old or new function.

The problem is that since the call offset field can be arbitrarily
aligned -- it could even cross page boundaries -- you still have
absolutely no SMP atomicity guarantees. So you still have all the same
problems. Without

-hpa

tip-bot for Jeremy Fitzhardinge

unread,

May 15, 2009, 2:20:09 PM5/15/09

to

Commit-ID: b4ecc126991b30fe5f9a59dfacda046aeac124b2
Gitweb: http://git.kernel.org/tip/b4ecc126991b30fe5f9a59dfacda046aeac124b2
Author: Jeremy Fitzhardinge <jer...@goop.org>
AuthorDate: Wed, 13 May 2009 17:16:55 -0700
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Fri, 15 May 2009 20:07:42 +0200

x86: Fix performance regression caused by paravirt_ops on native kernels

Xiaohui Xin and some other folks at Intel have been looking into what's
behind the performance hit of paravirt_ops when running native.

It appears that the hit is entirely due to the paravirtualized

spinlocks introduced by:

| commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8
| Date: Mon Jul 7 12:07:51 2008 -0700
|
| paravirt: introduce a "lock-byte" spinlock implementation

The extra call/return in the spinlock path is somehow

[ Impact: fix pvops performance regression when running native ]

Analysed-by: "Xin Xiaohui" <xiaoh...@intel.com>
Analysed-by: "Li Xin" <xin...@intel.com>
Analysed-by: "Nakajima Jun" <jun.na...@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fi...@citrix.com>
Acked-by: H. Peter Anvin <h...@zytor.com>
Cc: Nick Piggin <npi...@suse.de>
Cc: Xen-devel <xen-...@lists.xensource.com>
LKML-Reference: <4A0B62F7...@goop.org>
[ fixed the help text ]
Signed-off-by: Ingo Molnar <mi...@elte.hu>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df9e885..a6efe0a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -498,6 +498,19 @@ config PARAVIRT

over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.

+config PARAVIRT_SPINLOCKS

+ bool "Paravirtualization layer for spinlocks"
+ depends on PARAVIRT && SMP && EXPERIMENTAL

+ ---help---
+ Paravirtualized spinlocks allow a pvops backend to replace the
+ spinlock implementation with something virtualization-friendly
+ (for example, block the virtual CPU rather than spinning).
+

+ Unfortunately the downside is an up to 5% performance hit on
+ native kernels, with various workloads.
+
+ If you are unsure how to answer this question, answer N.

+
config PARAVIRT_CLOCK
bool
default n
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h

index 378e369..a53da00 100644

--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -1443,7 +1443,7 @@ u64 _paravirt_ident_64(u64);

#define paravirt_nop ((void *)_paravirt_nop)

-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)

static inline int __raw_spin_is_locked(struct raw_spinlock *lock)
{
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e5e6caf..b7e5db8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -172,7 +172,7 @@ static inline int __ticket_spin_is_contended(raw_spinlock_t *lock)
return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
}

-#ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PARAVIRT_SPINLOCKS

static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
{
@@ -206,7 +206,7 @@ static __always_inline void __raw_spin_lock_flags(raw_spinlock_t *lock,
__raw_spin_lock(lock);
}

-#endif
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */

static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
{
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile

index 145cce7..88d1bfc 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -89,7 +89,8 @@ obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o

obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o
obj-$(CONFIG_KVM_GUEST) += kvm.o
obj-$(CONFIG_KVM_CLOCK) += kvmclock.o
-obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o

obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c

index 8e45f44..9faf43b 100644

--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -134,7 +134,9 @@ static void *get_call_destination(u8 type)
.pv_irq_ops = pv_irq_ops,
.pv_apic_ops = pv_apic_ops,
.pv_mmu_ops = pv_mmu_ops,
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
+#endif
};
return *((void **)&tmpl + type);
}
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 3b767d0..172438f 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -9,5 +9,6 @@ obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \
time.o xen-asm.o xen-asm_$(BITS).o \
grant-table.o suspend.o

-obj-$(CONFIG_SMP) += smp.o spinlock.o
-obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
\ No newline at end of file
+obj-$(CONFIG_SMP) += smp.o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
+obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h

index 2013946..ca6596b 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -62,15 +62,26 @@ void xen_setup_vcpu_info_placement(void);

Jeremy Fitzhardinge

unread,

May 15, 2009, 3:00:20 PM5/15/09

to

Jan Beulich wrote:
> A patch for the pv-ops kernel would require some time. What I can give you
> right away - just for reference - are the sources we currently use in our kernel:
> attached.

Hm, I see. Putting a call out to a pv-ops function in the ticket lock
slow path looks pretty straightforward. The need for an extra lock on
the contended unlock side is a bit unfortunate; have you measured to see
what hit that has? Seems to me like you could avoid the problem by
using per-cpu storage rather than stack storage (though you'd need to
copy the per-cpu data to stack when handling a nested spinlock).

What's the thinking behind the xen_spin_adjust() stuff?

> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) {
> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if
> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1
> << 10; __ticket_spin_lock_body; } while (unlikely(!count) &&
> !xen_spin_wait(lock, token)); }

How does this work? Doesn't it always go into the slowpath loop even if
the preamble got the lock with no contention?

Jan Beulich

unread,

May 18, 2009, 3:20:16 AM5/18/09

to

>>> Jeremy Fitzhardinge <jer...@goop.org> 15.05.09 20:50 >>>

>Jan Beulich wrote:
>> A patch for the pv-ops kernel would require some time. What I can give you
>> right away - just for reference - are the sources we currently use in our kernel:
>> attached.
>
>Hm, I see. Putting a call out to a pv-ops function in the ticket lock
>slow path looks pretty straightforward. The need for an extra lock on
>the contended unlock side is a bit unfortunate; have you measured to see
>what hit that has? Seems to me like you could avoid the problem by
>using per-cpu storage rather than stack storage (though you'd need to
>copy the per-cpu data to stack when handling a nested spinlock).

Not sure how you'd imagine this to work: The unlock code has to look at all
cpus' data in either case, so an inner lock would still be required imo.

>What's the thinking behind the xen_spin_adjust() stuff?

That's the placeholder for implementing interrupt re-enabling in the irq-save
lock path. The requirement is that if a nested lock attempt hits the same
lock on the same cpu that it failed to get acquired on earlier (but got a ticket
already), tickets for the given (lock, cpu) pair need to be circularly shifted
around so that the innermost requestor gets the earliest ticket. This is what
that function's job will become if I ever get to implement this.

>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) {
>> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if
>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1
>> << 10; __ticket_spin_lock_body; } while (unlikely(!count) &&
>> !xen_spin_wait(lock, token)); }
>
>How does this work? Doesn't it always go into the slowpath loop even if
>the preamble got the lock with no contention?

It indeed always enters the slowpath loop, but only for a single pass through
part of its body (the first compare in the body macro will make it exit the loop
right away: 'token' is not only the ticket here, but the full lock->slock
contents). But yes, I think you're right, one could avoid entering the body
altogether by moving the containing loop into the if(!free) body. The logic
went through a number of re-writes, so I must have overlooked that
opportunity on the last round of adjustments.

Jan

Jeremy Fitzhardinge

unread,

May 20, 2009, 6:50:11 PM5/20/09

to

Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <jer...@goop.org> 15.05.09 20:50 >>>
>>>>
>> Jan Beulich wrote:
>>
>>> A patch for the pv-ops kernel would require some time. What I can give you
>>> right away - just for reference - are the sources we currently use in our kernel:
>>> attached.
>>>
>> Hm, I see. Putting a call out to a pv-ops function in the ticket lock
>> slow path looks pretty straightforward. The need for an extra lock on
>> the contended unlock side is a bit unfortunate; have you measured to see
>> what hit that has? Seems to me like you could avoid the problem by
>> using per-cpu storage rather than stack storage (though you'd need to
>> copy the per-cpu data to stack when handling a nested spinlock).
>>
>
> Not sure how you'd imagine this to work: The unlock code has to look at all
> cpus' data in either case, so an inner lock would still be required imo.
>

Well, rather than an explicit lock I was thinking of an optimistic
scheme like the pv clock update algorithm.

"struct spinning" would have a version counter it could update using the
even=stable/odd=unstable algorithm. The lock side would simply save the
current per-cpu "struct spinning" state onto its own stack (assuming you
can even have nested spinning), and then update the per-cpu copy with
the current lock. The kick side can check the version counter to make
sure it gets a consistent snapshot of the target cpu's current lock state.

I think that only the locking side requires locked instructions, and the
kick side can be all unlocked.

>> What's the thinking behind the xen_spin_adjust() stuff?
>>
>
> That's the placeholder for implementing interrupt re-enabling in the irq-save
> lock path. The requirement is that if a nested lock attempt hits the same
> lock on the same cpu that it failed to get acquired on earlier (but got a ticket
> already), tickets for the given (lock, cpu) pair need to be circularly shifted
> around so that the innermost requestor gets the earliest ticket. This is what
> that function's job will become if I ever get to implement this.
>

Sounds fiddly.

>>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) {
>>> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if
>>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1
>>> << 10; __ticket_spin_lock_body; } while (unlikely(!count) &&
>>> !xen_spin_wait(lock, token)); }
>>>
>> How does this work? Doesn't it always go into the slowpath loop even if
>> the preamble got the lock with no contention?
>>
>
> It indeed always enters the slowpath loop, but only for a single pass through
> part of its body (the first compare in the body macro will make it exit the loop
> right away: 'token' is not only the ticket here, but the full lock->slock
> contents). But yes, I think you're right, one could avoid entering the body
> altogether by moving the containing loop into the if(!free) body. The logic
> went through a number of re-writes, so I must have overlooked that
> opportunity on the last round of adjustments.
>

I was thinking of something like this: (completely untested)

void __ticket_spin_lock(raw_spinlock_t *lock)
{
unsigned short inc = 0x100;
unsigned short token;

do {
unsigned count = 1 << 10;
asm volatile (
" lock xaddw %w0, %1\n"
"1: cmpb %h0, %b0\n"
" je 2f\n"
" rep ; nop\n"
" movb %1, %b0\n"
/* don't need lfence here, because loads are in-order */
" sub $1,%2\n"
" jnz 1b\n"
"2:"
: "+Q" (inc), "+m" (lock->slock), "+r" (count)
:
: "memory", "cc");

if (likely(count != 0))
break;

token = inc;
inc = 0;
} while (unlikely(!xen_spin_wait(lock, token)));
}

where xen_spin_wait() would actually be a pvops call, and perhaps the
asm could be pulled out into an inline to deal with the 8/16 bit ticket
difference.

J

Jeremy Fitzhardinge

unread,

May 21, 2009, 6:50:07 PM5/21/09

to

Chuck Ebbert wrote:

> On Wed, 13 May 2009 17:16:55 -0700
> Jeremy Fitzhardinge <jer...@goop.org> wrote:
>
>
>> Paravirt patching turns all the pvops calls into direct calls, so
>> _spin_lock etc do end up having direct calls. For example, the compiler
>> generated code for paravirtualized _spin_lock is:
>>
>> <_spin_lock+0>: mov %gs:0xb4c8,%rax
>> <_spin_lock+9>: incl 0xffffffffffffe044(%rax)
>> <_spin_lock+15>: callq *0xffffffff805a5b30
>> <_spin_lock+22>: retq
>>
>> The indirect call will get patched to:
>> <_spin_lock+0>: mov %gs:0xb4c8,%rax
>> <_spin_lock+9>: incl 0xffffffffffffe044(%rax)
>> <_spin_lock+15>: callq <__ticket_spin_lock>
>> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */
>> <_spin_lock+22>: retq
>>
>>
>

> Can't those calls be changed to jumps?
>

In this specific instance of this example, yes. But if you start
enabling various spinlock debug options then there'll be code following
the call. It would be hard for the runtime patching machinery to know
when it would be safe to do the substitution.

J

Chuck Ebbert

unread,

May 21, 2009, 6:50:09 PM5/21/09

to

On Wed, 13 May 2009 17:16:55 -0700
Jeremy Fitzhardinge <jer...@goop.org> wrote:

> Paravirt patching turns all the pvops calls into direct calls, so
> _spin_lock etc do end up having direct calls. For example, the compiler
> generated code for paravirtualized _spin_lock is:
>
> <_spin_lock+0>: mov %gs:0xb4c8,%rax
> <_spin_lock+9>: incl 0xffffffffffffe044(%rax)
> <_spin_lock+15>: callq *0xffffffff805a5b30
> <_spin_lock+22>: retq
>
> The indirect call will get patched to:
> <_spin_lock+0>: mov %gs:0xb4c8,%rax
> <_spin_lock+9>: incl 0xffffffffffffe044(%rax)
> <_spin_lock+15>: callq <__ticket_spin_lock>
> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */
> <_spin_lock+22>: retq
>

Can't those calls be changed to jumps?

H. Peter Anvin

unread,

May 21, 2009, 7:20:12 PM5/21/09

to

Jeremy Fitzhardinge wrote:
>
> In this specific instance of this example, yes. But if you start
> enabling various spinlock debug options then there'll be code following
> the call. It would be hard for the runtime patching machinery to know
> when it would be safe to do the substitution.
>

"When it is immediately followed by a ret" seems like a straightforward
rule to me?

-hpa

Xin, Xiaohui

unread,

May 21, 2009, 9:30:12 PM5/21/09

to

Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call.

Thanks
Xiaohui

H. Peter Anvin

unread,

May 21, 2009, 11:40:10 PM5/21/09

to

Xin, Xiaohui wrote:
> Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call.

I didn't, no. That seems extremely weird to me.

(Unbalancing the call/ret stack is known to suck royally, of course.)

>>>
>>>
>> Can't those calls be changed to jumps?
>>
>
> In this specific instance of this example, yes. But if you start
> enabling various spinlock debug options then there'll be code following
> the call. It would be hard for the runtime patching machinery to know
> when it would be safe to do the substitution.
>

When there is code after the call, it's rather obviously not safe.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--

Jeremy Fitzhardinge

unread,

May 22, 2009, 12:30:07 AM5/22/09

to

Xin, Xiaohui wrote:
> Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call.
>

I don't think you had mentioned that. You're saying that a
call->jmp->ret is slower than call->call->ret->ret?

Xin, Xiaohui

unread,

May 22, 2009, 2:10:13 AM5/22/09

to

What I mean is that if the binary of _spin_lock is like this:
(gdb) disassemble _spin_lock
Dump of assembler code for function _spin_lock:
0xffffffff80497c0f <_spin_lock+0>: mov 1252634(%rip),%r11 # #0xffffffff805c9930 <test_lock_ops+16>
0xffffffff80497c16 <_spin_lock+7>: jmpq *%r11
End of assembler dump.
(gdb) disassemble

In this situation the binary contains a jump, the overhead is more than the call.

Thanks
Xiaohui

-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jer...@goop.org]
Sent: 2009年5月22日 12:28
To: Xin, Xiaohui
Cc: Chuck Ebbert; Ingo Molnar; Li, Xin; Nakajima, Jun; H. Peter Anvin; Nick Piggin; Linux Kernel Mailing List; Xen-devel
Subject: Re: Performance overhead of paravirt_ops on native identified

H. Peter Anvin

unread,

May 22, 2009, 12:40:11 PM5/22/09

to

Xin, Xiaohui wrote:
> What I mean is that if the binary of _spin_lock is like this:
> (gdb) disassemble _spin_lock
> Dump of assembler code for function _spin_lock:
> 0xffffffff80497c0f <_spin_lock+0>: mov 1252634(%rip),%r11 # #0xffffffff805c9930 <test_lock_ops+16>
> 0xffffffff80497c16 <_spin_lock+7>: jmpq *%r11
> End of assembler dump.
> (gdb) disassemble
>
> In this situation the binary contains a jump, the overhead is more than the call.
>

That's an indirect jump, though. I don't think anyone was suggesting
using an indirect jump; the final patched version should be a direct
jump (instead of a direct call.)

I can see how indirect jumps might be slower, since they are probably
not optimized as aggressively in hardware as indirect calls -- indirect
jumps are generally used for switch tables, which often have low
predictability, whereas indirect calls are generally used for method
calls, which are (a) incredibly important for OOP languages, and (b)
generally highly predictable on the dynamic scale.

However, direct jumps and calls don't need prediction at all (although
of course rets do.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--

H. Peter Anvin

unread,

May 22, 2009, 6:50:07 PM5/22/09

to

Jeremy Fitzhardinge wrote:
>
> I did a quick experiment to see how many sites this optimisation could
> actually affect. Firstly, it does absolutely nothing with frame
> pointers enabled. Arranging for no frame pointers is quite tricky,
> since it means disabling all debugging, tracing and other things.
>
> With no frame pointers, its about 26 of 5400 indirect calls are
> immediately followed by ret (not all of those sites are pvops calls).
> With preempt disabled, this goes up to 45 sites.
>
> I haven't done any actual runtime tests, but a quick survey of the
> affected sites shows that only a couple are performance-sensitive;
> _spin_lock and _spin_lock_irq and _spin_lock_irqsave are the most obvious.
>

OK, that doesn't seem like a very productive avenue. Problem still
remains, obviously.

-hpa

Jeremy Fitzhardinge

unread,

May 22, 2009, 6:50:13 PM5/22/09

to

H. Peter Anvin wrote:
> That's an indirect jump, though. I don't think anyone was suggesting
> using an indirect jump; the final patched version should be a direct
> jump (instead of a direct call.)
>
> I can see how indirect jumps might be slower, since they are probably
> not optimized as aggressively in hardware as indirect calls -- indirect
> jumps are generally used for switch tables, which often have low
> predictability, whereas indirect calls are generally used for method
> calls, which are (a) incredibly important for OOP languages, and (b)
> generally highly predictable on the dynamic scale.
>
> However, direct jumps and calls don't need prediction at all (although
> of course rets do.)

I did a quick experiment to see how many sites this optimisation could

actually affect. Firstly, it does absolutely nothing with frame
pointers enabled. Arranging for no frame pointers is quite tricky,
since it means disabling all debugging, tracing and other things.

With no frame pointers, its about 26 of 5400 indirect calls are
immediately followed by ret (not all of those sites are pvops calls).
With preempt disabled, this goes up to 45 sites.

I haven't done any actual runtime tests, but a quick survey of the
affected sites shows that only a couple are performance-sensitive;
_spin_lock and _spin_lock_irq and _spin_lock_irqsave are the most obvious.

J

Ingo Molnar

unread,

May 25, 2009, 5:20:10 AM5/25/09

to

I did more 'perf stat mmap-perf 1' measurements (bound to a single
core, running single thread - to exclude cross-CPU noise), which in
essence measures CONFIG_PARAVIRT=y overhead on native kernels:

Performance counter stats for './mmap-perf':

[vanilla] [PARAVIRT=y]

1230.805297 1242.828348 task clock ticks (msecs) + 0.97%
3602663413 3637329004 CPU cycles (events) + 0.96%
1927074043 1958330813 instructions (events) + 1.62%

That's around 1% on really fast hardware (Core2 E6800 @ 2.93 GHz,
4MB L2 cache), i.e. still significant overhead. Distros generally
enable CONFIG_PARAVIRT, even though a large majority of users never
actually runs them as Xen guests.

Are there plans to analyze and fix this overhead too, beyond the
paravirt-spinlocks overhead you analyzed? (Note that i had
CONFIG_PARAVIRT_SPINLOCKS disabled in this test.)

I think only those users should get overhead who actually run such
kernels in a virtualized environment.

I cannot cite a single other kernel feature that has so much
performance impact when runtime-disabled. For example, an often
cited bloat and overhead source is CONFIG_SECURITY=y.

Its runtime overhead (same system, same workload) is:

[vanilla] [SECURITY=y]

1219.652255 1230.805297 task clock ticks (msecs) + 0.91%
3574548461 3602663413 CPU cycles (events) + 0.78%
1915177924 1927074043 instructions (events) + 0.62%

( With the difference that the distros that enable CONFIG_SECURITY=y
tend to install and use at least one security module by default. )

So everyone who runs a CONFIG_PARAVIRT=y distro kernel has 1% of
overhead in this mmap-test workload - even if no Xen is used on that
box, ever.

Config attached.

Ingo

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.30-rc7
# Mon May 25 10:52:09 2009
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
# CONFIG_CGROUPS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_EXTRA_PASS=y
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_HAVE_PERF_COUNTERS=y

#
# Performance Counters
#
CONFIG_PERF_COUNTERS=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_MARKERS is not set
CONFIG_OPROFILE=y
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_API_DEBUG=y
# CONFIG_SLOW_WORK is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
# CONFIG_MODULES is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_MPPARSE=y
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_XEN=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=32
CONFIG_XEN_SAVE_RESTORE=y
# CONFIG_XEN_DEBUG_FS is not set
# CONFIG_KVM_CLOCK is not set
# CONFIG_KVM_GUEST is not set
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_P6_NOP=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
# CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_IOMMU_API is not set
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=16
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_X86_CPU_DEBUG is not set
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_HAVE_MLOCK=y
CONFIG_HAVE_MLOCKED_PAGE_BIT=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
# CONFIG_EFI is not set
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
# CONFIG_SUSPEND is not set
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
# CONFIG_ACPI_PROC_EVENT is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_SPEEDSTEP_CENTRINO=y
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
# CONFIG_INTR_REMAP is not set
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
# CONFIG_HT_IRQ is not set
# CONFIG_PCI_IOV is not set
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_LRO is not set
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_NETLABEL is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_NETFILTER_ADVANCED is not set

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_LOG is not set
CONFIG_NF_CONNTRACK=y
# CONFIG_NF_CONNTRACK_SECMARK is not set
CONFIG_NF_CONNTRACK_FTP=y
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XT_TARGET_MARK=y
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
CONFIG_NETFILTER_XT_TARGET_SECMARK=y
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_MARK=y
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
CONFIG_NETFILTER_XT_MATCH_STATE=y
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_CONNTRACK_IPV4=y
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_LOG=y
CONFIG_IP_NF_TARGET_ULOG=y
CONFIG_NF_NAT=y
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
CONFIG_NF_NAT_FTP=y
# CONFIG_NF_NAT_IRC is not set
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
# CONFIG_NF_NAT_SIP is not set
CONFIG_IP_NF_MANGLE=y

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV6=y
CONFIG_IP6_NF_IPTABLES=y
CONFIG_IP6_NF_MATCH_IPV6HEADER=y
CONFIG_IP6_NF_TARGET_LOG=y
CONFIG_IP6_NF_FILTER=y
CONFIG_IP6_NF_TARGET_REJECT=y
# CONFIG_IP6_NF_MANGLE is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_INGRESS is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=y
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
CONFIG_SYS_HYPERVISOR=y
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
CONFIG_BLK_CPQ_DA=y
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
# CONFIG_BLK_DEV_LOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_XEN_BLKDEV_FRONTEND=y
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_ISL29003 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_93CX6 is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
CONFIG_SCSI_NETLINK=y
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_FCOE_FNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
CONFIG_SATA_NV=y
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_COMPAT_NET_DEV_OPS=y
# CONFIG_IFB is not set
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=y
# CONFIG_TYPHOON is not set
# CONFIG_ETHOC is not set
# CONFIG_DNET is not set
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_B44 is not set
CONFIG_FORCEDETH=y
# CONFIG_FORCEDETH_NAPI is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
# CONFIG_8139TOO_TUNE_TWISTER is not set
# CONFIG_8139TOO_8129 is not set
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_IP1000 is not set
CONFIG_IGB=y
# CONFIG_IGBVF is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3_DEPENDS=y
# CONFIG_CHELSIO_T3 is not set
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLGE is not set
# CONFIG_SFC is not set
# CONFIG_BE2NET is not set
CONFIG_TR=y
# CONFIG_IBMOL is not set
# CONFIG_3C359 is not set
# CONFIG_TMS380TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_WAN is not set
CONFIG_XEN_NETDEV_FRONTEND=y
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
CONFIG_NETCONSOLE=y
# CONFIG_NETCONSOLE_DYNAMIC is not set
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set
CONFIG_XEN_KBDDEV_FRONTEND=y

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_INPUT_TABLET is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_AD7879_I2C is not set
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_WM97XX is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
CONFIG_STALDRV=y
# CONFIG_STALLION is not set
# CONFIG_ISTALLION is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
# CONFIG_IPMI_HANDLER is not set
# CONFIG_HW_RANDOM is not set
# CONFIG_NVRAM is not set
CONFIG_RTC=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
# CONFIG_I2C_CHARDEV is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
# CONFIG_I2C_VOODOO3 is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_PLATFORM is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_HWMON is not set
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DVB_CORE is not set
# CONFIG_VIDEO_MEDIA is not set

#
# Multimedia drivers
#
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_I915_KMS is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_VGASTATE is not set
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_VESA is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_XEN_FBDEV_FRONTEND=y
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
# CONFIG_BACKLIGHT_CLASS_DEVICE is not set

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=y

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=y
# CONFIG_SND_SEQ_DUMMY is not set
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=y
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_RTCTIMER=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
# CONFIG_SND_VERBOSE_PROCFS is not set
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_AC97_CODEC=y
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_AC97_POWER_SAVE is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5530 is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDA_HWDEP is not set
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
# CONFIG_SND_HDA_CODEC_NVHDMI is not set
CONFIG_SND_HDA_CODEC_INTELHDMI=y
CONFIG_SND_HDA_ELD=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
# CONFIG_SND_HDA_POWER_SAVE is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_HIFIER is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=y
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
# CONFIG_DRAGONRISE_FF is not set
CONFIG_HID_EZKEY=y
CONFIG_HID_KYE=y
CONFIG_HID_GYRATION=y
CONFIG_HID_KENSINGTON=y
CONFIG_HID_LOGITECH=y
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_NTRIG=y
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SUNPLUS=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_TOPSEED=y
CONFIG_THRUSTMASTER_FF=y
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set

#
# OTG and related infrastructure
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=y
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_AMD8131 is not set
# CONFIG_EDAC_AMD8111 is not set
# CONFIG_RTC_CLASS is not set
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
CONFIG_XEN_BALLOON=y
CONFIG_XEN_SCRUB_PAGES=y
CONFIG_XEN_DEV_EVTCHN=y
CONFIG_XENFS=y
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_SYS_HYPERVISOR=y
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_WMI is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
CONFIG_ACPI_WMI=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
# CONFIG_CONFIGFS_FS is not set
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ALLOW_WARNINGS=y
# CONFIG_ENABLE_WARN_DEPRECATED is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_DETECT_HUNG_TASK is not set
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_HAVE_FTRACE_SYSCALLS=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SYSPROF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_CONTEXT_SWITCH_TRACER is not set
# CONFIG_ENABLE_EVENT_TRACING is not set
# CONFIG_FTRACE_SYSCALLS is not set
# CONFIG_BOOT_TRACER is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_POWER_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_HW_BRANCH_TRACER is not set
# CONFIG_KMEMTRACE is not set
# CONFIG_WORKQUEUE_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
CONFIG_X86_DS_SELFTEST=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
# CONFIG_SECURITY_PATH is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
# CONFIG_SECURITY_ROOTPLUG is not set
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_IMA is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
CONFIG_CRYPTO_PCBC=y
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
# CONFIG_VIRTUALIZATION is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y

Jeremy Fitzhardinge

unread,

May 26, 2009, 2:50:10 PM5/26/09

to

Ingo Molnar wrote:
> I did more 'perf stat mmap-perf 1' measurements (bound to a single
> core, running single thread - to exclude cross-CPU noise), which in
> essence measures CONFIG_PARAVIRT=y overhead on native kernels:
>

Thanks for taking the time to make these measurements. You'll agree
they're much better numbers than the last time you ran these tests?

> Performance counter stats for './mmap-perf':
>
> [vanilla] [PARAVIRT=y]
>
> 1230.805297 1242.828348 task clock ticks (msecs) + 0.97%
> 3602663413 3637329004 CPU cycles (events) + 0.96%
> 1927074043 1958330813 instructions (events) + 1.62%
>
> That's around 1% on really fast hardware (Core2 E6800 @ 2.93 GHz,
> 4MB L2 cache), i.e. still significant overhead. Distros generally
> enable CONFIG_PARAVIRT, even though a large majority of users never
> actually runs them as Xen guests.
>

Did you do only a single run, or is this the result of multiple runs?
If so, what was your procedure? How did you control for page
placement/cache effects/other boot-to-boot variations?

Your numbers are not dissimilar to my measurements, but I also saw up to
1% performance improvement vs native from boot to boot (I saw up to 10%
reduction of cache misses with pvops, possibly because of its
de-inlining effects).

I also saw about 1% boot to boot variation with the non-pvops kernel.

While I think pvops does add *some* overhead, I think the absolute
magnitude is swamped in the noise. The best we can say is "somewhere
under 1% on modern hardware".

> Are there plans to analyze and fix this overhead too, beyond the
> paravirt-spinlocks overhead you analyzed? (Note that i had
> CONFIG_PARAVIRT_SPINLOCKS disabled in this test.)
>
> I think only those users should get overhead who actually run such
> kernels in a virtualized environment.
>
> I cannot cite a single other kernel feature that has so much
> performance impact when runtime-disabled. For example, an often
> cited bloat and overhead source is CONFIG_SECURITY=y.
>

Your particular benchmark does many, many mmap/mprotect/munmap/mremap
calls, and takes a lot of pagefaults. That's going to hit the hot path
with lots of pte updates and so on, but very few security hooks. How
does it compare with a more balanced workload?

> Its runtime overhead (same system, same workload) is:
>
> [vanilla] [SECURITY=y]
>
> 1219.652255 1230.805297 task clock ticks (msecs) + 0.91%
> 3574548461 3602663413 CPU cycles (events) + 0.78%
> 1915177924 1927074043 instructions (events) + 0.62%
>
> ( With the difference that the distros that enable CONFIG_SECURITY=y
> tend to install and use at least one security module by default. )
>
> So everyone who runs a CONFIG_PARAVIRT=y distro kernel has 1% of
> overhead in this mmap-test workload - even if no Xen is used on that
> box, ever.
>

So you're saying that:

* CONFIG_SECURITY adding +0.91% to wallclock time is OK, but pvops
adding +0.97% is not,
* your test is sensitive enough to make 0.06% difference
significant, and
* this benchmark is representative enough of real workloads that its
results are overall meaningful?

> Config attached.
>

Is this derived from a RH distro config?

J

Nick Piggin

unread,

May 28, 2009, 2:20:07 AM5/28/09

to

FWIW, we had to disable paravirt in our default SLES11 kernel.
(admittedly this was before some of the recent improvements were
made). But there are only so many 1% performance regressions you
can introduce before customers won't upgrade (or vendors won't
publish benchmarks with the new software).

But OTOH, almost any bit feature is going to cost performance. I don't
think this is something new (as noted with CONFIG_SECURITY). It is
just something people have to trade off and decide for themselves.
If you make it configurable and keep performance as good as reasonably
possible, then I don't think more can be asked.

If performance overhead is too much and/or too few users can take
advantage of a feature, then distros can always special-case it. I
think may did for pae...

Jeremy Fitzhardinge

unread,

May 28, 2009, 5:00:16 PM5/28/09

to

Nick Piggin wrote:
> FWIW, we had to disable paravirt in our default SLES11 kernel.
> (admittedly this was before some of the recent improvements were
> made).

Yes, I think you'll find it worth trying with it enabled again. The
spinlock thing clearly slowed things down, but when that's disabled the
difference to native is very small.

> But OTOH, almost any bit feature is going to cost performance. I don't
> think this is something new (as noted with CONFIG_SECURITY). It is
> just something people have to trade off and decide for themselves.
> If you make it configurable and keep performance as good as reasonably
> possible, then I don't think more can be asked.
>

Yes, that's exactly my point. If I've worked on a feature, I clearly
want people to use that feature. Part of making it useful is to make
the distro/vendor/user decision to enable that feature as easy as
possible, by making the tradeoffs simple.

But tradeoffs are always going to cut both ways: positive (kernel
automatically works in a wider range of environments), and negative
(performance questions, complexity, etc). Ultimately its the distro's
decision to enable a particular feature, and the distro's responsibility
to cope with the consequences of that.

> If performance overhead is too much and/or too few users can take
> advantage of a feature, then distros can always special-case it. I
> think may did for pae...

I think that would be a clear sign of a problem. The whole point of
pvops is to avoid needing multiple kernel builds.

J

Ingo Molnar

unread,

May 30, 2009, 6:30:19 AM5/30/09

to

* Nick Piggin <npi...@suse.de> wrote:

> FWIW, we had to disable paravirt in our default SLES11 kernel.
> (admittedly this was before some of the recent improvements were
> made). But there are only so many 1% performance regressions you
> can introduce before customers won't upgrade (or vendors won't
> publish benchmarks with the new software).
>
> But OTOH, almost any bit feature is going to cost performance. I don't

> think this is something new (as noted with CONFIG_SECURITY). [...]

Yes in a way, but the difference is that:

- i noted CONFIG_SECURITY as the _worst current example_. It is the
largest-overhead feature known to me in this area, and i
benchmark the kernel a lot. CONFIG_PARAVIRT has _even more_
overhead so it takes the (dubious) top spot in this category.

- CONFIG_SECURITY, in the distros where it's enabled (most of them)
it is actually being relied on by the default user-space. It's
being configured and every default install of the distro has a
real (or at least perceived) advantage from it.

Not so with CONFIG_PARAVIRT. That feature is almost fully parasitic
to native environments: currently it brings no advantage on native
hardware _at all_ (and 95% of the users dont care about Xen).

Still it's impractical for a distro to disable it because adding a
separate kernel package has high costs too and PARAVIRT=y is needed
for the weird execution environment that Xen requires to run Linux
with acceptable speed.

It's as if we paid a full 1% overhead from the requirements of say
Centaur CPUs, on all other CPUs (Intel and AMD). That would be
inacceptable to owners of Intel and AMD CPUs: and so would it be
inacceptable to distros to have a separate kernel package for it.

A distro is unable to hold the tide of creeping bloat - they dont
have the long-term view (and probably shouldnt have it - a distro
should care about and maximize the here-and-now utility of Linux
mostly). Upstream maintainers are the ones to say "NO" to such crap,
even if it's unpopular. In this current case that's me i guess.

Note what _is_ acceptable and what _is_ doable is to be a bit more
inventive when dumping this optional, currently-high-overhead
paravirt feature on us. My message to Xen folks is: use dynamic
patching, fix your hypervisor and just use plain old-fashioned
_restraint_ and common sense when engineering things, and for
heaven's sake, _care_ about the native kernel's performance because
in the long run it's your bread and butter too.

1% overhead (paid by _everyone_ who runs that distro kernel) for
something 95%+ of the users _wont ever use_ is _not acceptable_.

So unless this overhead is brought down significantly i see the dom0
patches as outright harmful: upstream dom0 makes PARAVIRT=Y even
_harder_ for distros to disable, and there will be even _less_
incentive for Xen folks to get the native bloat they caused under
control.

Note that specific pieces of the dom0 patches, like the io-apic
driver patches, are still acceptable of course, as they improve the
overall quality of the kernel. Helping dom0 too as a 'side effect'
is fully acceptable as long as everyone else is helped too.

And if the performance problems are largely fixed (say 0.1%-0.2%
overhead in the benchmarks i did would be acceptable IMO), and if
the patches are all squeaky-clean, i suspect we can do upstream dom0
as well.

Ingo

Chris Mason

unread,

Jun 2, 2009, 10:30:26 AM6/2/09

to

I think all of the distros expect mainline to do a difficult balancing
act. On the one hand we don't want any performance regressions, and on
the other hand, we need to be able to work with mainline to get the
major features that we're shipping and using every day in.

What we have here is a CONFIG_ that would be heavily used if it were
included. It may be used as part of a separate kernel build, or it may
be used for a single kernel image. Either way, this is something for
the distros to decide. You're arguing to leave it out of mainline
because making a separate kernel package for it is too hard, but by
doing so forcing them to maintain the patches out of tree. They still
have the same decision about the second package either way.

The xen developers are not dropping this code on mainline and running
away to spend the rest of their lives on the beach. This is is a step
in long term development, and keeping it out of the kernel is making
testing and continued development much harder on the users.

I'm not suggesting we should take broken code, or that we should lower
standards just for xen. But, expecting the xen developers to fix the 1%
hit on a very specific micro-benchmark is not a way to promote new
projects for the kernel, and it isn't a good way to convince people to
do continued development in mainline instead of in private trees.

Please reconsider. Keeping these patches out is only making it harder
on the people that want to make them better.

-chris

Ulrich Drepper

unread,

Jun 2, 2009, 11:00:23 AM6/2/09

to

On Tue, Jun 2, 2009 at 7:18 AM, Chris Mason <chris...@oracle.com> wrote:
> I'm not suggesting we should take broken code, or that we should lower
> standards just for xen. But, expecting the xen developers to fix the 1%
> hit on a very specific micro-benchmark is not a way to promote new
> projects for the kernel, and it isn't a good way to convince people to
> do continued development in mainline instead of in private trees.

It's not a new project which needs to be treated with kid's gloves.
And one be sure that once the code is in the tree those interested
parties will not be as strongly motivated to fix any problem like
this. Ingo pointed to a way which doesn't negatively impact the
performance of the Xen kernel and reduces the overhead (dynamic
patching). Just get started on this (and general cleanup) and this
whole argument will go away.

I find it ridiculous to use the "but it's used" argument to try to
force the code into the kernel. By this argument you can say the same
about crap like ndiswrapper and similarly harmful code.

Chris Mason

unread,

Jun 2, 2009, 11:20:09 AM6/2/09

to

On Tue, Jun 02, 2009 at 07:49:32AM -0700, Ulrich Drepper wrote:
> On Tue, Jun 2, 2009 at 7:18 AM, Chris Mason <chris...@oracle.com> wrote:
> > I'm not suggesting we should take broken code, or that we should lower
> > standards just for xen. �But, expecting the xen developers to fix the 1%
> > hit on a very specific micro-benchmark is not a way to promote new
> > projects for the kernel, and it isn't a good way to convince people to
> > do continued development in mainline instead of in private trees.
>
> It's not a new project which needs to be treated with kid's gloves.

Sure, I'm not suggesting we send them flowers on mothers day or
anything, and I'm not suggesting they skip out on important cleanups.
But, I strongly object to a 1% hit on a random micro benchmark as a
reason to keep the code out.

> And one be sure that once the code is in the tree those interested
> parties will not be as strongly motivated to fix any problem like
> this.

The idea that people shipping xen aren't interested in performance
regressions is really strange to me.

> Ingo pointed to a way which doesn't negatively impact the
> performance of the Xen kernel and reduces the overhead (dynamic
> patching). Just get started on this (and general cleanup) and this
> whole argument will go away.

Dynamic patching is a big wad of duct tape over the problem.

>
> I find it ridiculous to use the "but it's used" argument to try to
> force the code into the kernel. By this argument you can say the same
> about crap like ndiswrapper and similarly harmful code.

I'm not saying to take harmful code, I'm saying to take code with a
small performance regression under a specific CONFIG_. Slub regresses
more than 1% on database loads, CONFIG_SCHED_GROUPS, the list goes on
and on.

The idea that we should take code that is heavily used is important.
The best place to fix xen is in the kernel. It always has been, and
keeping it out is just making it harder on everyone involved.

-chris

Ulrich Drepper

unread,

Jun 2, 2009, 11:30:25 AM6/2/09

to

On Tue, Jun 2, 2009 at 8:03 AM, Chris Mason <chris...@oracle.com> wrote:
> The idea that people shipping xen aren't interested in performance
> regressions is really strange to me.

Why? They have a different base line. For them any regression to
bare hardware performance is even a positive (since it means the gap
between hardware and virt shrinks).

> Dynamic patching is a big wad of duct tape over the problem.

And what do you call the Xen model? It's a perfect fit IMO.

> I'm not saying to take harmful code, I'm saying to take code with a
> small performance regression under a specific CONFIG_. Slub regresses
> more than 1% on database loads, CONFIG_SCHED_GROUPS, the list goes on
> and on.

None of those have to be enabled in default kernels.

> The best place to fix xen is in the kernel.

No. The best way to fix things is _on the way into the kernel_.

Chris Mason

unread,

Jun 2, 2009, 12:30:23 PM6/2/09

to

On Tue, Jun 02, 2009 at 08:22:57AM -0700, Ulrich Drepper wrote:
> On Tue, Jun 2, 2009 at 8:03 AM, Chris Mason <chris...@oracle.com> wrote:
> > The idea that people shipping xen aren't interested in performance
> > regressions is really strange to me.
>
> Why? They have a different base line. For them any regression to
> bare hardware performance is even a positive (since it means the gap
> between hardware and virt shrinks).

And we would have gotten away with it too if it weren't for you meddling
kids!

>
>
> > Dynamic patching is a big wad of duct tape over the problem.
>
> And what do you call the Xen model? It's a perfect fit IMO.
>
> > I'm not saying to take harmful code, I'm saying to take code with a
> > small performance regression under a specific CONFIG_. �Slub regresses
> > more than 1% on database loads, CONFIG_SCHED_GROUPS, the list goes on
> > and on.
>
> None of those have to be enabled in default kernels.
>
>
> > The best place to fix xen is in the kernel.
>
> No. The best way to fix things is _on the way into the kernel_.

It all depends on which parts are causing problems. A 1% performance
hit, under a CONFIG_ that can be disabled? If maintainers are focusing
on details like this for long term and active projects, we're doing
something very wrong.

-chris

Pekka Enberg

unread,

Jun 2, 2009, 2:10:13 PM6/2/09

to

Hi Chris,

On Tue, Jun 2, 2009 at 6:03 PM, Chris Mason <chris...@oracle.com> wrote:
>> I find it ridiculous to use the "but it's used" argument to try to
>> force the code into the kernel. �By this argument you can say the same
>> about crap like ndiswrapper and similarly harmful code.
>
> I'm not saying to take harmful code, I'm saying to take code with a
> small performance regression under a specific CONFIG_. �Slub regresses
> more than 1% on database loads, CONFIG_SCHED_GROUPS, the list goes on
> and on.

Maybe it's just me but you make it sound like the SLUB regression is
okay. It's not.

Unfortunately we're now in a position where we can't just remove SLUB
(it's an improvement over SLAB for NUMA) so we're stuck with two
allocators with third one on its way to the kernel. So yes, it makes a
lot of sense to me to fix CONFIG_PARAVIRT regression before merging
more of the xen stuff in the kernel. It's always easier to fix these
things before they hit the kernel and people start to depend on them.

Pekka

Pekka Enberg

unread,

Jun 2, 2009, 2:20:10 PM6/2/09

to

Hi Chris,

On Tue, Jun 2, 2009 at 8:03 AM, Chris Mason <chris...@oracle.com> wrote:
>>> The best place to fix xen is in the kernel.

On Tue, Jun 02, 2009 at 08:22:57AM -0700, Ulrich Drepper wrote:
>> No. �The best way to fix things is _on the way into the kernel_.

On Tue, Jun 2, 2009 at 7:20 PM, Chris Mason <chris...@oracle.com> wrote:
> It all depends on which parts are causing problems. �A 1% performance
> hit, under a CONFIG_ that can be disabled? �If maintainers are focusing
> on details like this for long term and active projects, we're doing
> something very wrong.

The fact that CONFIG_PARAVIRT can be disabled doesn't really help. As
a matter of fact, I'd argue that one of the primary reasons
CONFIG_SLUB regression is still there is because people can just
disable it and use CONFIG_SLAB instead.

So I think we have some evidence to suggest that people have less
incentive to fix things once something is merged to the kernel. And I
don't mean the authors of the code here but basically _everyone_
involved in kernel development. It usually takes effort from variety
of people to get everything ironed out because, lets face it, we can't
expect a handful of people to test out every configuration let alone
fix them.

Pekka

Chris Mason

unread,

Jun 2, 2009, 2:40:04 PM6/2/09

to

On Tue, Jun 02, 2009 at 09:06:41PM +0300, Pekka Enberg wrote:
> Hi Chris,
>
> On Tue, Jun 2, 2009 at 6:03 PM, Chris Mason <chris...@oracle.com> wrote:
> >> I find it ridiculous to use the "but it's used" argument to try to
> >> force the code into the kernel. �By this argument you can say the same
> >> about crap like ndiswrapper and similarly harmful code.
> >
> > I'm not saying to take harmful code, I'm saying to take code with a
> > small performance regression under a specific CONFIG_. �Slub regresses
> > more than 1% on database loads, CONFIG_SCHED_GROUPS, the list goes on
> > and on.
>
> Maybe it's just me but you make it sound like the SLUB regression is
> okay. It's not.

Well, it is and it isn't. SLUB was implemented with specific workloads
in mind. I'd prefer that regressions not get in at all, but sometimes
it takes the broad exposure you get from being in mainline to
finish things. Sometimes we finish things with rm, but without slub I
don't think the issues it was trying to solve would have been discussed
at all.

>
> Unfortunately we're now in a position where we can't just remove SLUB
> (it's an improvement over SLAB for NUMA) so we're stuck with two
> allocators with third one on its way to the kernel. So yes, it makes a
> lot of sense to me to fix CONFIG_PARAVIRT regression before merging
> more of the xen stuff in the kernel. It's always easier to fix these
> things before they hit the kernel and people start to depend on them.

The problem is that people already depend on them ;) If people want to
nack xen based on code structure, that's more than fair, I just hope we
can keep the discussion around something the xen developers can work
toward.

Micro benchmarks come and go, we tune as best we can based on the
tradeoffs at hand.

-chris

Thomas Gleixner

unread,

Jun 2, 2009, 3:30:11 PM6/2/09

to

On Tue, 2 Jun 2009, Chris Mason wrote:
> I'm not suggesting we should take broken code, or that we should lower
> standards just for xen. But, expecting the xen developers to fix the 1%
> hit on a very specific micro-benchmark is not a way to promote new
> projects for the kernel, and it isn't a good way to convince people to
> do continued development in mainline instead of in private trees.
>
> Please reconsider. Keeping these patches out is only making it harder
> on the people that want to make them better.

You are missing one subtle point.

I read several times, that A, B and C can not be changed design wise
to allow newer kernels to run on older hypervisors. That's what
frightens me:

dom0 imposes a kind of ABI which we can not change anymore.

So where is the room for the improvements which you expect when dom0
is merged ? It's not about micro benchmark results, it's about the
inability to fix the existing design decisions in the near future.

You can change the internals of btrfs as often as you want, but you
can not change the on disk layout at will. And while you can invent
btrfs2 w/o any impact aside of grumpy users and a couple of thousand
lines self contained code, dom0v2 would just add a different layer of
intrusiveness into the x86 code base w/o removing the existing one.

Thanks,

tglx

Chris Mason

unread,

Jun 2, 2009, 4:00:16 PM6/2/09

to

On Tue, Jun 02, 2009 at 09:14:23PM +0200, Thomas Gleixner wrote:
> On Tue, 2 Jun 2009, Chris Mason wrote:
> > I'm not suggesting we should take broken code, or that we should lower
> > standards just for xen. But, expecting the xen developers to fix the 1%
> > hit on a very specific micro-benchmark is not a way to promote new
> > projects for the kernel, and it isn't a good way to convince people to
> > do continued development in mainline instead of in private trees.
> >
> > Please reconsider. Keeping these patches out is only making it harder
> > on the people that want to make them better.
>
> You are missing one subtle point.
>

I'm sure I'm missing many more than one ;)

> I read several times, that A, B and C can not be changed design wise
> to allow newer kernels to run on older hypervisors. That's what
> frightens me:
>
> dom0 imposes a kind of ABI which we can not change anymore.
>
> So where is the room for the improvements which you expect when dom0
> is merged ? It's not about micro benchmark results, it's about the
> inability to fix the existing design decisions in the near future.
>
> You can change the internals of btrfs as often as you want, but you
> can not change the on disk layout at will. And while you can invent
> btrfs2 w/o any impact aside of grumpy users and a couple of thousand
> lines self contained code, dom0v2 would just add a different layer of
> intrusiveness into the x86 code base w/o removing the existing one.

Well, if there's a line we want to draw in the sand based on firm and
debatable criteria, great.

The problem I see here is that our line in the sand for the xen
developers is fuzzy and winding (yeah, I saw Linus' reply in the other
thread, full ack on that).

-chris

Jeremy Fitzhardinge

unread,

Jun 3, 2009, 2:40:12 AM6/3/09

to

Ulrich Drepper wrote:
> Ingo pointed to a way which doesn't negatively impact the
> performance of the Xen kernel and reduces the overhead (dynamic
> patching)

The pvops code is already fully dynamically patched, which replaces all
the indirect calls with either direct calls, inline instructions or
nops. It has been this way from the initial implementation.

More recently I changed the calling convention on some of the most
common critical-path ops to reduce the register pressure caused by the
function call clobbers; you just don't need a pile of registers to
disable interrupts.

Ingo knows all this, so I'm not sure what further patching he's
suggesting. I don't see any more likely candidates, but I'm open to
suggestions.

J

Rusty Russell

unread,

Jun 3, 2009, 8:40:18 AM6/3/09

to

On Sat, 30 May 2009 07:53:30 pm Ingo Molnar wrote:
> Not so with CONFIG_PARAVIRT. That feature is almost fully parasitic
> to native environments: currently it brings no advantage on native
> hardware _at all_ (and 95% of the users dont care about Xen).

And VMI, and of course lguest. And a little bit of KVM, though not the paths
you're talking about.

Your complaints are a little unfocussed: anything Xen could do to make this
overhead go away, we would be able to do ourselves. Yet it's not clear where
this 1% is going. We're more aggressive with our patching right now than any
other subsystem; which call is the problem?

But your entire rant is willfully ignorant; if you've only just discovered
that all commonly enabled config options have a cost when unused, I am shocked.

I took my standard config, and turned on AUDIT, CGROUP, all the sched options,
all the namespace options, profiling, markers, kprobes, relocatable kernel,
1000Hz, preempt, support for every x86 variant (ie. PAE, NUMA, HIGHMEM64,
DISCONTIGMEM). I turned off kernel debugging and paravirt. Booted with
maxcpus=1.

I created another with SMP=n, and all that turned off. No highmem. 100Hz.

Then I ran virtbench which I had lying around, in local mode (ie. between
processes running locally, rather than between guest OSes)

Distro-style maximal config:
Time for one context switch via pipe: 2285 (2276 - 2290)
Time for one Copy-on-Write fault: 3415 (3264 - 4266)
Time to exec client once: 234656 (232906 - 253343)
Time for one fork/exit/wait: 82656 (82031 - 83218)
Time for gettimeofday(): 253 (253 - 254)
Time to send 4 MB from host: 6911750 (6901000 - 6925500)
Time for one int-0x80 syscall: 284 (284 - 369)
Time for one syscall via libc: 139 (139 - 140)
Time to walk linear 64 MB: 760375 (754750 - 868125)
Time to walk random 64 MB: 955500 (947250 - 990000)
Time for two PTE updates: 3173 (3143 - 3196)
Time to read from disk (256 kB): 2395000 (2319000 - 2434500)
Time for one disk read: 114156 (112906 - 114562)
Time to send 4 MB between guests: 7639000 (7568250 - 7739750)
Time for inter-guest pingpong: 13900 (13800 - 13931)
Time to sendfile 4 MB between guests: 7187000 (7129000 - 46349000)
Time to receive 1000 1k UDPs between guests: 6576000 (6500000 - 7232000)

Custom-style minimal config:
Time for one context switch via pipe: 1351 (1333 - 1405)
Time for one Copy-on-Write fault: 2754 (2610 - 3586)
Time to exec client once: 190625 (189812 - 207500)
Time for one fork/exit/wait: 60968 (60875 - 61218)
Time for gettimeofday(): 248 (248 - 249)
Time to send 4 MB from host: 6643250 (6583750 - 6880750)
Time for one int-0x80 syscall: 280 (280 - 334)
Time for one syscall via libc: 133 (133 - 144)
Time to walk linear 64 MB: 758750 (752375 - 835000)
Time to walk random 64 MB: 943500 (934500 - 1084000)
Time for two PTE updates: 1917 (1900 - 2401)
Time to read from disk (256 kB): 2390500 (2309000 - 2536000)
Time for one disk read: 113250 (112937 - 113875)
Time to send 4 MB between guests: 7830500 (7740500 - 7946000)
Time for inter-guest pingpong: 12566 (11621 - 13652)
Time to sendfile 4 MB between guests: 6533000 (5961000 - 76365000)
Time to receive 1000 1k UDPs between guests: 5278000 (5194000 - 5431000)

Average slowdown: 15%
Worst: context switch, 69% slowdown.
Best: 4MB inter-process. 2% speedup (maybe due to more mem, 4G machine)

So in fact CONFIG_PARAVIRT's 1% makes it a paragon.

We should be praising Jeremy for his efforts and asking him to look at some of
these others!

Sorry for the facts,
Rusty.

Linus Torvalds

unread,

Jun 3, 2009, 12:20:07 PM6/3/09

to

On Wed, 3 Jun 2009, Rusty Russell wrote:
>
> I took my standard config, and turned on AUDIT, CGROUP, all the sched options,
> all the namespace options, profiling, markers, kprobes, relocatable kernel,
> 1000Hz, preempt, support for every x86 variant (ie. PAE, NUMA, HIGHMEM64,
> DISCONTIGMEM). I turned off kernel debugging and paravirt. Booted with
> maxcpus=1.

Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
can't compare it to a no-highmem case).

It's one of those options that we do to support crazy hardware, and it is
EXTREMELY expensive (but mainly only if you actually have the hardware, ie
you actually have more than 1GB of RAM for HIGHMEM4G - HIGHMEM64G is
always expensive for forks, but nobody sane ever enables it).

IOW, it's not at all comparable to the other options. It's not a software
option, it's a real hardware option that hits you not depending on whether
you want some sw capability, but on whether you want to use memory.

Because depending on the CPU, some loads will have 25% of time spent in
just kmap/kunmap due to TLB flushes. Yes, really. There's a reason 32-bit
kernels are shit for 1GB+ memory.

After you've turned off HIGHMEM (or run on a sane architecture like x86-64
that doesn't need it), re-run the benchmark, because it's interesting. But
with HIGHMEM being different, your benchmark is totally invalid and
pointless.

Linus

Rusty Russell

unread,

Jun 4, 2009, 3:00:24 AM6/4/09

to

(Re-send for lkml. For some reason kmail decided HTML mail was cool again).

On Thu, 4 Jun 2009 01:39:38 am Linus Torvalds wrote:
> On Wed, 3 Jun 2009, Rusty Russell wrote:
> > I took my standard config, and turned on AUDIT, CGROUP, all the sched
> > options, all the namespace options, profiling, markers, kprobes,
> > relocatable kernel, 1000Hz, preempt, support for every x86 variant (ie.
> > PAE, NUMA, HIGHMEM64, DISCONTIGMEM). I turned off kernel debugging and
> > paravirt. Booted with maxcpus=1.
>
> Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> can't compare it to a no-highmem case).

Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
unreasonable for a distro tho, so I turned that on instead.

Anyway, so everyone can play along at home, I've extracted the context-switch
test into a standalone proggie; run it in a loop to make it warm and watch the
results. I tested it here on my 3G RAM Core2 Duo laptop (travelling, away
from previous test box), and here are typical results (./context-switch 1000):

minimal config:
0.001271
0.001279
0.001284
0.001279
0.001290

maximal config (as before, but NUMAQ=n, HIGHMEM4G, maxcpus=1):
0.002476
0.002507
0.002491
0.002518
0.002505

maximum config (as above with mem=880M ie. no actual highmem)
0.001917
0.001893
0.001936
0.001915
0.001925

So we're paying a 48% overhead; microbenchmarks always suffer as code is added,
and we've added a lot of code with these options.

I've attached my maximal and minimal configs if people are overly curious.
Rusty.

#include <sys/types.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <err.h>
#include <sys/time.h>
#include <stdio.h>

static void do_context_switch(int runs,
struct timeval *start, struct timeval *end)
{
char c = 1;
int fds1[2], fds2[2], child;

if (pipe(fds1) != 0 || pipe(fds2) != 0)
err(1, "Creating pipe");

child = fork();
if (child == -1)
err(1, "forking");

if (child > 0) {
close(fds1[0]);
close(fds2[1]);

gettimeofday(start, NULL);
while (runs > 0) {
write(fds1[1], &c, 1);
read(fds2[0], &c, 1);
runs -= 2;
}
gettimeofday(end, NULL);

waitpid(child, NULL, 0);
close(fds1[1]);
close(fds2[0]);
} else {
close(fds2[0]);
close(fds1[1]);

while (runs > 0) {
read(fds1[0], &c, 1);
write(fds2[1], &c, 1);
runs -= 2;
}
exit(0);
}
}

int main(int argc, char *argv[])
{
struct timeval start, end, diff;

if (argc != 2)
errx(1, "Usage: context-switch <num-runs>");

do_context_switch(atoi(argv[1]), &start, &end);

timersub(&end, &start, &diff);

printf("%lu.%06lu\n", diff.tv_sec, diff.tv_usec);
return 0;
}

minimal-config

maximal-config

Linus Torvalds

unread,

Jun 4, 2009, 11:10:13 AM6/4/09

to

On Thu, 4 Jun 2009, Rusty Russell wrote:
> >
> > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > can't compare it to a no-highmem case).
>

> Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> unreasonable for a distro tho, so I turned that on instead.

Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.

The thing I disagree with is that it's at all valid to then compare to
some all-software feature thing. HIGHMEM doesn't expand any esoteric
capability that some people might use - it's about regular RAM for regular
users.

And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
hated having to merge it, and I still hate it. It's a stupid, ugly, and
very invasive config option. It's just that it's there to support a
stupid, ugly and very annoying fundamental hardware problem.

So I think your minimum and maximum configs should at least _match_ in
HIGHMEM. Limiting memory to not actually having any (with "mem=880M") will
avoid the TLB flushing impact of HIGHMEM, which is clearly going to be the
_bulk_ of the overhead, but HIGHMEM is still going to be noticeable on at
least some microbenchmarks.

In other words, it's a lot like CONFIG_SMP, but at least CONFIG_SMP has a
damn better reason for existing today than CONFIG_HIGHMEM.

That said, I suspect that now your context-switch test is likely no longer
dominated by that thing, so looking at your numbers:

> minimal config: ~0.001280
> maximal config: ~0.002500 (with actual high mem)
> maximum config: ~0.001925 (with mem=880M)

and I think that change from 0.001280 - 0.001925 (rough averages by
eye-balling it, I didn't actually calculate anything) is still quite
interesting, but I do wonder how much of it ends up being due to just code
generation issues for CONFIG_HIGHMEM and CONFIG_SMP.

> So we're paying a 48% overhead; microbenchmarks always suffer as code is added,
> and we've added a lot of code with these options.

I do agree that microbenchmarks are interesting, and tend to show these
kinds of things clearly. It's just that when you look at the scheduler,
for example, something like SMP support is a _big_ issue, and even if we
get rid of the worst synchronization overhead with "maxcpus=1" at least
removing the "lock" prefixes, I'm not sure how relevant it is to say that
the scheduler is slower with SMP support.

(The same way I don't think it's relevant or interesting to see that it's
slower with HIGHMEM).

They are simply so fundamental features that the two aren't comparable.
Why would anybody compare a UP scheduler with a SMP scheduler? It's simply
not the same problem. What does it mean to say that one is 48% slower?
That's like saying that a squirrell is 48% juicier than an orange - maybe
it's true, but anybody who puts the two in a blender to compare them is
kind of sick. The comparison is ugly and pointless.

Now, other feature comparisons are way more interesting. For example, if
statistics gathering is a noticeable portion of the 48%, then that really
is a very relevant comparison, since scheduler statistics is something
that is in no way "fundamental" to the hardware base, and most people
won't care.

So comparing a "scheduler statistics" overhead vs "minimal config"
overhead is very clearly a sane thing to do. Now we're talking about a
feature that most people - even if it was somehow hardware related -
wouldn't use or care about.

IOW, even if it were to use hardware features (say, something like
oprofile, which is at least partly very much about exposing actual
physical features of the hardware), if it's not fundamental to the whole
usage for a huge percentage of people, then it's a "optional feature", and
seeing slowdown is a big deal.

Something like CONFIG_HIGHMEM* or CONFIG_SMP is not really what I'd ever
call "optional feature", although I hope to Dog that CONFIG_HIGHMEM can
some day be considered that some day.

Dave McCracken

unread,

Jun 4, 2009, 6:00:12 PM6/4/09

to

On Thursday 04 June 2009, Linus Torvalds wrote:
> On Thu, 4 Jun 2009, Rusty Russell wrote:
> > > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > > can't compare it to a no-highmem case).
> >
> > Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> > unreasonable for a distro tho, so I turned that on instead.
>

> Well, I agree that HIGHMEM4G is a reasonable thing to turn on.

>
> The thing I disagree with is that it's at all valid to then compare to
> some all-software feature thing. HIGHMEM doesn't expand any esoteric
> capability that some people might use - it's about regular RAM for regular
> users.

I think you're missing the point of Rusty's benchmark. I see his exercise as
"compare a kernel configured as a distro would vs a custom-built kernel
configured for the exact target environment". In that light, questions about
the CONFIG options Rusty used should be based on whether most distros would
use them in their stock kernels as opposed to how necessary they are.

What I see as the message of his benchmark is if you care about performance
you should be customizing your kernel anyway. Distro kernels are slow. An
option that makes the distro kernel a bit slower is no big deal since anyone
who wants speed should already be rebuilding their kernel.

Don't get me wrong. I think it's always a good idea to minimize any
performance penalty, even under specific configurations. I just think
criticizing it because distros might enable it is a poor argument.

Dave McCracken
Oracle Corp.

Rusty Russell

unread,

Jun 5, 2009, 12:50:06 AM6/5/09

to

On Fri, 5 Jun 2009 12:32:14 am Linus Torvalds wrote:
> So I think your minimum and maximum configs should at least _match_ in
> HIGHMEM. Limiting memory to not actually having any (with "mem=880M") will
> avoid the TLB flushing impact of HIGHMEM, which is clearly going to be the
> _bulk_ of the overhead, but HIGHMEM is still going to be noticeable on at
> least some microbenchmarks.

Well, Ingo was ranting because (paraphrase) "no other config option when
*unused* has as much impact as CONFIG_PARAVIRT!!!!!!!!!!".

That was the point of my mail; facts show it's simply untrue.

> The comparison is ugly and pointless.

(Re: SMP)

Distributions don't ship UP kernels any more; this shows what that costs if
you're actually on a UP box. If we really don't care, perhaps we should make
CONFIG_SMP=n an option under EMBEDDED for x86. And we can rip out the complex
patching SMP patching stuff too.

> Something like CONFIG_HIGHMEM* or CONFIG_SMP is not really what I'd ever
> call "optional feature", although I hope to Dog that CONFIG_HIGHMEM can
> some day be considered that some day.

Someone from a distro might know how many deployed machines don't need them.
Kernel hackers tend to have modern machines; same with "enterprise" sites. I
have no idea.

Without those facts, I'll leave further discussion to someone else :)

Thanks,
Rusty.

Gerd Hoffmann

unread,

Jun 5, 2009, 3:40:09 AM6/5/09

to

Hi,

> I think you're missing the point of Rusty's benchmark. I see his exercise as
> "compare a kernel configured as a distro would vs a custom-built kernel
> configured for the exact target environment". In that light, questions about
> the CONFIG options Rusty used should be based on whether most distros would
> use them in their stock kernels as opposed to how necessary they are.

Well. The test ran on a machine with so much memory that you need
HIGHMEM to use it all. I think it also was SMP. So a custom kernel for
*that* machine would certainly include SMP and HIGHMEM ...

> What I see as the message of his benchmark is if you care about performance
> you should be customizing your kernel anyway.

Sure. That wouldn't include turning off HIGHMEM and SMP though because
you need them to make full use of your hardware. While it might be
interesting by itself to see what the overhead of these config options
is, it is IMHO quite pointless *in the context of this discussion*.

All the other options (namespaces, audit, statistics, whatnot) are
different: You check whenever you want that $feature, if not you can
turn it off. Distros tend to have them all turned on. So looking at
the overhead of these config options when enabled + unused (and compare
to the paravirt overhead) is certainly a valid thing.

cheers,
Gerd

Rusty Russell

unread,

Jun 5, 2009, 10:40:10 AM6/5/09

to

On Fri, 5 Jun 2009 05:01:25 pm Gerd Hoffmann wrote:
> Hi,
>
> > I think you're missing the point of Rusty's benchmark. I see his
> > exercise as "compare a kernel configured as a distro would vs a
> > custom-built kernel configured for the exact target environment". In
> > that light, questions about the CONFIG options Rusty used should be based
> > on whether most distros would use them in their stock kernels as opposed
> > to how necessary they are.
>
> Well. The test ran on a machine with so much memory that you need
> HIGHMEM to use it all. I think it also was SMP. So a custom kernel for
> *that* machine would certainly include SMP and HIGHMEM ...

I have a UP machine with 512M of RAM, but I wasn't going to take it out just
to prove the point. Hence I used my test machine with mem=880 maxcpus=1 to
simulate it, but that's a distraction here.

> While it might be
> interesting by itself to see what the overhead of these config options
> is, it is IMHO quite pointless *in the context of this discussion*.

No, you completely missed the point.
Rusty.

Linus Torvalds

unread,

Jun 5, 2009, 11:00:14 AM6/5/09

to

On Fri, 5 Jun 2009, Rusty Russell wrote:
>
> Distributions don't ship UP kernels any more; this shows what that costs if
> you're actually on a UP box. If we really don't care, perhaps we should make
> CONFIG_SMP=n an option under EMBEDDED for x86. And we can rip out the complex
> patching SMP patching stuff too.

The complex SMP patching is what makes it _possible_ to not ship UP
kernels any more.

The SMP overhead exists, but it would be even higher if we didn't patch
things to remove the "lock" prefix.

Linus

Anders K. Pedersen

unread,

Jun 6, 2009, 3:10:07 PM6/6/09

to

Dave McCracken wrote:
> What I see as the message of his benchmark is if you care about performance
> you should be customizing your kernel anyway. Distro kernels are slow. An
> option that makes the distro kernel a bit slower is no big deal since anyone
> who wants speed should already be rebuilding their kernel.

And Oracle of course supports customers doing that?

Not in my experience, and the same goes for most other commercial
enterprise software on Linux as well, so customers have to stick to
distro kernels, if they want support.

Regards,
Anders K. Pedersen

Rusty Russell

unread,

Jun 6, 2009, 9:00:15 PM6/6/09

to

On Sat, 6 Jun 2009 12:24:43 am Linus Torvalds wrote:
> On Fri, 5 Jun 2009, Rusty Russell wrote:
> > Distributions don't ship UP kernels any more; this shows what that costs
> > if you're actually on a UP box. If we really don't care, perhaps we
> > should make CONFIG_SMP=n an option under EMBEDDED for x86. And we can
> > rip out the complex patching SMP patching stuff too.
>
> The complex SMP patching is what makes it _possible_ to not ship UP
> kernels any more.

"possible"? You mean "acceptable". Gray, not black and white.

1) Where's the line?
2) Where are we? Does patching claw back 5% of the loss? 50%? 90%?

No point benchmarking on my (SMP) laptop for this one. Gerd cc'd, maybe he
has benchmarks from when he did the work originally?

Thanks,
Rusty.

Linus Torvalds

unread,

Jun 8, 2009, 11:00:11 AM6/8/09

to

On Sun, 7 Jun 2009, Rusty Russell wrote:
>
> "possible"? You mean "acceptable". Gray, not black and white.

I don't think we can possibly claim to support UP configurations if we
don't patch.

> 1) Where's the line?

"As good as we can make it". There is no line. There's "your code sucks so
badly that it needs to get fixed, or we'll rip it out or disable it".

> 2) Where are we? Does patching claw back 5% of the loss? 50%? 90%?

On some things, especially on P4, the lock overhead was tens of percent.
Just a single locked instruction takes closer to two hundred instructions.

Of course, on those P4's, just kernel entry/exit is pretty high too (even
with sysenter/exit), so I doubt you'll ever see something be 90% just
because of that, unless it causes extra IO or other non-CPU issues.

Linus

Nick Piggin

unread,

Jun 9, 2009, 5:40:13 AM6/9/09

to

On Thu, Jun 04, 2009 at 08:02:14AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 4 Jun 2009, Rusty Russell wrote:
> > >
> > > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > > can't compare it to a no-highmem case).
> >
> > Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> > unreasonable for a distro tho, so I turned that on instead.
>
> Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.
>
> The thing I disagree with is that it's at all valid to then compare to
> some all-software feature thing. HIGHMEM doesn't expand any esoteric
> capability that some people might use - it's about regular RAM for regular
> users.
>
> And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> hated having to merge it, and I still hate it. It's a stupid, ugly, and
> very invasive config option. It's just that it's there to support a
> stupid, ugly and very annoying fundamental hardware problem.

I was looking forward to be able to get rid of it... unfortunately
other 32-bit architectures are starting to use it again :(

I guess it is not incredibly intrusive for generic mm code. A bit
of kmap sprinkled around which is actually quite a useful delimiter
of where pagecache is addressed via its kernel mapping.

Do you hate more the x86 code? Maybe that can be removed?

Ingo Molnar

unread,

Jun 9, 2009, 7:20:06 AM6/9/09

to

* Nick Piggin <npi...@suse.de> wrote:

> On Thu, Jun 04, 2009 at 08:02:14AM -0700, Linus Torvalds wrote:
> >
> >
> > On Thu, 4 Jun 2009, Rusty Russell wrote:
> > > >
> > > > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > > > can't compare it to a no-highmem case).
> > >
> > > Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> > > unreasonable for a distro tho, so I turned that on instead.
> >
> > Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.
> >
> > The thing I disagree with is that it's at all valid to then compare to
> > some all-software feature thing. HIGHMEM doesn't expand any esoteric
> > capability that some people might use - it's about regular RAM for regular
> > users.
> >
> > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > very invasive config option. It's just that it's there to support a
> > stupid, ugly and very annoying fundamental hardware problem.
>
> I was looking forward to be able to get rid of it... unfortunately
> other 32-bit architectures are starting to use it again :(
>
> I guess it is not incredibly intrusive for generic mm code. A bit
> of kmap sprinkled around which is actually quite a useful
> delimiter of where pagecache is addressed via its kernel mapping.
>
> Do you hate more the x86 code? Maybe that can be removed?

IMHO what hurts most about highmem isnt even its direct source code
overhead, but three factors:

- The buddy allocator allocates top down, with highmem pages first.
So a lot of critical apps (the first ones started) will have
highmem footprint, and that shows up every time they use it for
file IO or other ops. kmap() overhead and more.

- Highmem is not really a 'solvable' problem in terms of good VM
balancing. It gives conflicting constraints and there's no single
'good VM' that can really work - just a handful of bad solutions
that differ in their level and area of suckiness.

- The kmap() cache itself can be depleted, and using atomic kmaps
is fragile and error-prone. I think we still have a FIXME of a
possibly triggerable deadlock somewhere in the core MM code ...

OTOH, highmem is clearly a useful hardware enablement feature with a
slowly receding upside and a constant downside. The outcome is
clear: when a critical threshold is reached distros will stop
enabling it. (or more likely, there will be pure 64-bit x86 distros)

Highmem simply enables a sucky piece of hardware so the code itself
has an intrinsic level of suckage, so to speak. There's not much to
be done about it but it's not a _big_ problem either: this type of
hw is moving fast out of the distro attention span.

( What scares/worries me much more than sucky hardware is sucky
_software_ ABIs. Those have a half-life measured not in years but
in decades and they get put into new products stubbornly, again
and again. There's no Moore's Law getting rid of sucky software
really and unlike the present set of sucky highmem hardware
there's no influx of cosmic particles chipping away on their
installed base either. )

Ingo

Nick Piggin

unread,

Jun 9, 2009, 8:20:15 AM6/9/09

to

Yeah this really sucks about it. OTOH, we have basically the same
thing today with NUMA allocations and task placement.

> - Highmem is not really a 'solvable' problem in terms of good VM
> balancing. It gives conflicting constraints and there's no single
> 'good VM' that can really work - just a handful of bad solutions
> that differ in their level and area of suckiness.

But we have other zones too. And you also run into similar (and
in some senses harder) choices with NUMA as well.

> - The kmap() cache itself can be depleted,

Yeah, the rule is not allowed to do 2 nested ones.

> and using atomic kmaps
> is fragile and error-prone. I think we still have a FIXME of a
> possibly triggerable deadlock somewhere in the core MM code ...

Not that I know of. I fixed the last long standing known one
with the write_begin/write_end changes a year or two ago. It
wasn't exactly related to kmap of the pagecache (but page fault
of the user address in copy_from_user).

> OTOH, highmem is clearly a useful hardware enablement feature with a
> slowly receding upside and a constant downside. The outcome is
> clear: when a critical threshold is reached distros will stop
> enabling it. (or more likely, there will be pure 64-bit x86 distros)

Well now lots of embedded type archs are enabling it... So the
upside is slowly increasing again I think.

> Highmem simply enables a sucky piece of hardware so the code itself
> has an intrinsic level of suckage, so to speak. There's not much to
> be done about it but it's not a _big_ problem either: this type of
> hw is moving fast out of the distro attention span.

Yes but Linus really hated the code. I wonder whether it is
generic code or x86 specific. OTOH with x86 you'd probably
still have to support different page table formats, at least,
so you couldn't rip it all out.

Ingo Molnar

unread,

Jun 9, 2009, 8:30:18 AM6/9/09

to

* Nick Piggin <npi...@suse.de> wrote:

> > and using atomic kmaps
> > is fragile and error-prone. I think we still have a FIXME of a
> > possibly triggerable deadlock somewhere in the core MM code ...
>
> Not that I know of. I fixed the last long standing known one with
> the write_begin/write_end changes a year or two ago. It wasn't
> exactly related to kmap of the pagecache (but page fault of the
> user address in copy_from_user).

> > OTOH, highmem is clearly a useful hardware enablement feature
> > with a slowly receding upside and a constant downside. The
> > outcome is clear: when a critical threshold is reached distros
> > will stop enabling it. (or more likely, there will be pure
> > 64-bit x86 distros)
>
> Well now lots of embedded type archs are enabling it... So the
> upside is slowly increasing again I think.

Sure - but the question is always how often does it show up on lkml?
Less and less. There might be a lot of embedded Linux products sold,
but their users are not reporting bugs to us and are not sending
patches to us in the proportion of their apparent usage.

And on lkml there's a clear downtick in highmem relevance.

> > Highmem simply enables a sucky piece of hardware so the code
> > itself has an intrinsic level of suckage, so to speak. There's
> > not much to be done about it but it's not a _big_ problem
> > either: this type of hw is moving fast out of the distro
> > attention span.
>
> Yes but Linus really hated the code. I wonder whether it is
> generic code or x86 specific. OTOH with x86 you'd probably still
> have to support different page table formats, at least, so you
> couldn't rip it all out.

In practice the pte format hurts the VM more than just highmem. (the
two are inseparably connected of course)

I did this fork overhead measurement some time ago, using
perfcounters and 'perf':

Performance counter stats for './fork':

32-bit 32-bit-PAE 64-bit
--------- ---------- ---------
27.367537 30.660090 31.542003 task clock ticks (msecs)

5785 5810 5751 pagefaults (events)
389 388 388 context switches (events)
4 4 4 CPU migrations (events)
--------- ---------- ---------
+12.0% +15.2% overhead

So PAE is 12.0% slower (the overhead of double the pte size and
three page table levels), and 64-bit is 15.2% slower (the extra
overhead of having four page table levels added to the overhead of
double the pte size). [the pagefault count noise is well below the
systematic performance difference.]

Fork is pretty much the worst-case measurement for larger pte
overhead, as it has to copy around a lot of pagetables.

Larger ptes do not come for free and the 64-bit instructions do not
mitigate the cachemiss overhead and memory bandwidth cost.

Ingo

Nick Piggin

unread,

Jun 9, 2009, 8:50:08 AM6/9/09

to

On Tue, Jun 09, 2009 at 02:25:29PM +0200, Ingo Molnar wrote:
>
> * Nick Piggin <npi...@suse.de> wrote:
>
> > > and using atomic kmaps
> > > is fragile and error-prone. I think we still have a FIXME of a
> > > possibly triggerable deadlock somewhere in the core MM code ...
> >
> > Not that I know of. I fixed the last long standing known one with
> > the write_begin/write_end changes a year or two ago. It wasn't
> > exactly related to kmap of the pagecache (but page fault of the
> > user address in copy_from_user).
>
> > > OTOH, highmem is clearly a useful hardware enablement feature
> > > with a slowly receding upside and a constant downside. The
> > > outcome is clear: when a critical threshold is reached distros
> > > will stop enabling it. (or more likely, there will be pure
> > > 64-bit x86 distros)
> >
> > Well now lots of embedded type archs are enabling it... So the
> > upside is slowly increasing again I think.
>
> Sure - but the question is always how often does it show up on lkml?
> Less and less. There might be a lot of embedded Linux products sold,
> but their users are not reporting bugs to us and are not sending
> patches to us in the proportion of their apparent usage.
>
> And on lkml there's a clear downtick in highmem relevance.

Definitely. Probably it works *reasonably* well enough in the
end that embedded systems with reasonable highmem:lowmem ratio
probably will work OK. Sadly for them in a year or two they
probably get the full burden of carrying the crap ;)

No question about that... but you probably can't get rid of that
because somebody will cry about NX bit, won't they?

Avi Kivity

unread,

Jun 9, 2009, 9:00:11 AM6/9/09

to

Ingo Molnar wrote:
> Fork is pretty much the worst-case measurement for larger pte
> overhead, as it has to copy around a lot of pagetables.
>

We could eliminate that if we use the R/W bit on pgd entries. fork()
would be 256 clear_bit()s (1536 and 768 on i386 pae and nonpae).

copy_one_pte() disagrees though:

if (unlikely(!pte_present(pte))) {
if (!pte_file(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);

swap_duplicate(entry);
/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(&dst_mm->mmlist))) {
spin_lock(&mmlist_lock);
if (list_empty(&dst_mm->mmlist))
list_add(&dst_mm->mmlist,
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both parent
* and child to be set to read.
*/
make_migration_entry_read(&entry);
pte = swp_entry_to_pte(entry);
set_pte_at(src_mm, addr, src_pte, pte);
}
}
goto out_set_pte;
}

Not sure how we can enlaze this thing.

--
error compiling committee.c: too many arguments to function

Linus Torvalds

unread,

Jun 9, 2009, 11:00:06 AM6/9/09

to

On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > very invasive config option. It's just that it's there to support a
> > stupid, ugly and very annoying fundamental hardware problem.
>
> I was looking forward to be able to get rid of it... unfortunately
> other 32-bit architectures are starting to use it again :(

.. and 32-bit x86 is still not dead, and there are still people who use it
with more than 1G of RAM (ie it's not like it's just purely a "small
embedded cell-phones with Atom" kind of thing that Intel seems to be
pushing for eventually).

> I guess it is not incredibly intrusive for generic mm code. A bit
> of kmap sprinkled around which is actually quite a useful delimiter
> of where pagecache is addressed via its kernel mapping.
>
> Do you hate more the x86 code? Maybe that can be removed?

No, we can't remove the x86 code, and quite frankly, I don't even mind
that. The part I mind is actually the sprinkling of kmap all over. Do a
"git grep kmap fs", and you'll see that there are four times as many
kmap's in filesystem code than there are in mm/.

I was benchmarking btrfs on my little EeePC. There, kmap overhead was 25%
of file access time. Part of it is that people have been taught to use
"kmap_atomic()", which is usable under spinlocks and people have been told
that it's "fast". It's not fast. The whole TLB thing is slow as hell.

Oh well. It's sad. But we can't get rid of it.

Linus

Ingo Molnar

unread,

Jun 9, 2009, 11:00:13 AM6/9/09

to

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> I was benchmarking btrfs on my little EeePC. There, kmap overhead
> was 25% of file access time. Part of it is that people have been
> taught to use "kmap_atomic()", which is usable under spinlocks and
> people have been told that it's "fast". It's not fast. The whole
> TLB thing is slow as hell.

yeah. I noticed it some time ago that INVLPG is unreasonably slow.

My theory is that in the CPU it's perhaps a loop (in microcode?)
over _all_ TLBs - so as TLB caches get larger, INVLPG gets slower
and slower ...

Ingo

Linus Torvalds

unread,

Jun 9, 2009, 11:10:12 AM6/9/09

to

On Tue, 9 Jun 2009, Nick Piggin wrote:

> On Tue, Jun 09, 2009 at 01:17:19PM +0200, Ingo Molnar wrote:
> >
> > - The buddy allocator allocates top down, with highmem pages first.
> > So a lot of critical apps (the first ones started) will have
> > highmem footprint, and that shows up every time they use it for
> > file IO or other ops. kmap() overhead and more.
>
> Yeah this really sucks about it. OTOH, we have basically the same
> thing today with NUMA allocations and task placement.

It's not the buddy allocator. Each zone has it's own buddy list.

It's that we do the zones in order, and always start with the HIGHMEM
zone.

Which is quite reasonablefor most loads (if the page is only used as a
user mapping, we won't kmap it all that often), but it's bad for things
where we will actually want to touch it over and over again. Notably
filesystem caches that aren't just for user mappings.

> > Highmem simply enables a sucky piece of hardware so the code itself
> > has an intrinsic level of suckage, so to speak. There's not much to
> > be done about it but it's not a _big_ problem either: this type of
> > hw is moving fast out of the distro attention span.
>
> Yes but Linus really hated the code. I wonder whether it is
> generic code or x86 specific. OTOH with x86 you'd probably
> still have to support different page table formats, at least,
> so you couldn't rip it all out.

The arch-specific code really isn't that nasty. We have some silly
workarouds for doing 8-byte-at-a-time operations on x86-32 with cmpxchg8b
etc, but those are just odd small details.

If highmem was just a matter of arch details, I wouldn't mind it at all.

It's the generic code pollution I find annoying. It really does pollute a
lot of crap. Not just fs/ and mm/, but even drivers.

Linus

Linus Torvalds

unread,

Jun 9, 2009, 11:30:24 AM6/9/09

to

On Tue, 9 Jun 2009, Ingo Molnar wrote:
>
> In practice the pte format hurts the VM more than just highmem. (the
> two are inseparably connected of course)

I think PAE is a separate issue (ie I think HIGHMEM4G and HIGHMEM64G are
about different issues).

I do think we could probably drop PAE some day - very few 32-bit x86's
have more than 4GB of memory, and the ones that did buy lots of memory
back when it was a big deal for them have hopefully upgraded long since.

Of course, PAE also adds the NX flag etc, so there are probably other
reasons to have it. And qutie frankly, PAE is just a small x86-specific
detail that doesn't hurt anybody else.

So I have no reason to really dislike PAE per se - the real dislike is for
HIGHMEM itself, and that gets enabled already for HIGHMEM4G without any
PAE.

Of course, I'd also not ever enable it on any machine I have. PAE does add
overhead, and the NX bit isn't _that_ important to me.

Linus

Nick Piggin

unread,

Jun 9, 2009, 11:40:11 AM6/9/09

to

On Tue, Jun 09, 2009 at 07:54:00AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> > >
> > > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > > very invasive config option. It's just that it's there to support a
> > > stupid, ugly and very annoying fundamental hardware problem.
> >
> > I was looking forward to be able to get rid of it... unfortunately
> > other 32-bit architectures are starting to use it again :(
>
> .. and 32-bit x86 is still not dead, and there are still people who use it
> with more than 1G of RAM (ie it's not like it's just purely a "small
> embedded cell-phones with Atom" kind of thing that Intel seems to be
> pushing for eventually).
>
> > I guess it is not incredibly intrusive for generic mm code. A bit
> > of kmap sprinkled around which is actually quite a useful delimiter
> > of where pagecache is addressed via its kernel mapping.
> >
> > Do you hate more the x86 code? Maybe that can be removed?
>
> No, we can't remove the x86 code, and quite frankly, I don't even mind
> that. The part I mind is actually the sprinkling of kmap all over. Do a
> "git grep kmap fs", and you'll see that there are four times as many
> kmap's in filesystem code than there are in mm/.

Yeah, I guess I just don't see it as such a bad thing. As I said,
it's nice to have something to grep for and not have pointers into
pagecache stored around the place (although filesystems do that
with buffercache).

If code has to jump through particular nasty hoops to use atomic
kmaps, that's not such a good thing...

> I was benchmarking btrfs on my little EeePC. There, kmap overhead was 25%
> of file access time. Part of it is that people have been taught to use
> "kmap_atomic()", which is usable under spinlocks and people have been told
> that it's "fast". It's not fast. The whole TLB thing is slow as hell.
>
> Oh well. It's sad. But we can't get rid of it.

If it's such a problem, it could be made a lot faster without too
much problem. You could just introduce a FIFO of ptes behind it
and flush them all in one go. 4K worth of ptes per CPU might
hopefully bring your overhead down to < 1%.

Avi Kivity

unread,

Jun 9, 2009, 12:00:25 PM6/9/09

to

Ingo Molnar wrote:
> * Linus Torvalds <torv...@linux-foundation.org> wrote:
>
>
>> I was benchmarking btrfs on my little EeePC. There, kmap overhead
>> was 25% of file access time. Part of it is that people have been
>> taught to use "kmap_atomic()", which is usable under spinlocks and
>> people have been told that it's "fast". It's not fast. The whole
>> TLB thing is slow as hell.
>>
>
> yeah. I noticed it some time ago that INVLPG is unreasonably slow.
>
> My theory is that in the CPU it's perhaps a loop (in microcode?)
> over _all_ TLBs - so as TLB caches get larger, INVLPG gets slower
> and slower ...
>

The tlb already content-addresses entries when looking up translations,
so it shouldn't be that bad.

invlpg does have to invalidate all the intermediate entries
("paging-structure caches"), and it does (obviously) force a tlb reload.

I seem to recall 50 cycles for invlpg, what do you characterize as
unreasonably slow?

--
error compiling committee.c: too many arguments to function

--

Linus Torvalds

unread,

Jun 9, 2009, 12:10:11 PM6/9/09

to

On Tue, 9 Jun 2009, Nick Piggin wrote:
>
> If it's such a problem, it could be made a lot faster without too
> much problem. You could just introduce a FIFO of ptes behind it
> and flush them all in one go. 4K worth of ptes per CPU might
> hopefully bring your overhead down to < 1%.

We already have that. The regular kmap() does that. It's just not usable
in atomic context.

We'd need to fix the locking: right now kmap_high() uses non-irq-safe
locks, and it does that whole cross-cpu flushing thing (which is why
those locks _have_ to be non-irq-safe.

The way to fix that, though, would be to never do any cross-cpu calls, and
instead just have a cpumask saying "you need to flush before you do
anything with kmap". So you'd just set that cpumask inside the lock, and
if/when some other CPU does a kmap, they'd flush their local TLB at _that_
point instead of having to have an IPI call.

If we can get rid of kmap_atomic(), I'd already like HIGHMEM more. Right
now I absolutely _hate_ all the different "levels" of kmap_atomic() and
having to be careful about crazy nesting rules etc.

Linus

Linus Torvalds

unread,

Jun 9, 2009, 12:30:17 PM6/9/09

to

On Tue, 9 Jun 2009, Nick Piggin wrote:
>

> The idea seems nice but isn't the problem that kmap gives back a
> basically 1st class kernel virtual memory? (ie. it can then be used
> by any other CPU at any point without it having to use kmap?).

No, everybody has to use kmap()/kunmap().

The "problem" is that you could in theory run out of kmap frames, since if
everybody does a kmap() in an interruptible context and you have lots and
lots of threads doing different pages, you'd run out. But that has nothing
to do with kmap_atomic(), which is basically limited to just the number of
CPU's and a (very small) level of nesting.

Nick Piggin

unread,

Jun 9, 2009, 12:30:17 PM6/9/09

to

On Tue, Jun 09, 2009 at 09:00:08AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > If it's such a problem, it could be made a lot faster without too
> > much problem. You could just introduce a FIFO of ptes behind it
> > and flush them all in one go. 4K worth of ptes per CPU might
> > hopefully bring your overhead down to < 1%.
>
> We already have that. The regular kmap() does that. It's just not usable
> in atomic context.

Well this would be more like the kmap cache idea rather than the
kmap_atomic FIFO (which would remain per-cpu and look much like
the existing kmap_atomic).

> We'd need to fix the locking: right now kmap_high() uses non-irq-safe
> locks, and it does that whole cross-cpu flushing thing (which is why
> those locks _have_ to be non-irq-safe.
>
> The way to fix that, though, would be to never do any cross-cpu calls, and
> instead just have a cpumask saying "you need to flush before you do
> anything with kmap". So you'd just set that cpumask inside the lock, and
> if/when some other CPU does a kmap, they'd flush their local TLB at _that_
> point instead of having to have an IPI call.

The idea seems nice but isn't the problem that kmap gives back a

basically 1st class kernel virtual memory? (ie. it can then be used
by any other CPU at any point without it having to use kmap?).

--

Nick Piggin

unread,

Jun 9, 2009, 12:50:13 PM6/9/09

to

On Tue, Jun 09, 2009 at 09:26:47AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > The idea seems nice but isn't the problem that kmap gives back a
> > basically 1st class kernel virtual memory? (ie. it can then be used
> > by any other CPU at any point without it having to use kmap?).
>
> No, everybody has to use kmap()/kunmap().

So it is strictly a bug to expose a pointer returned by kmap to
another CPU? That would make it easier, although it would need
to remove the global bit I think so when one task migrates CPUs
then the entry will be flushed and reloaded properly.

> The "problem" is that you could in theory run out of kmap frames, since if
> everybody does a kmap() in an interruptible context and you have lots and
> lots of threads doing different pages, you'd run out. But that has nothing
> to do with kmap_atomic(), which is basically limited to just the number of
> CPU's and a (very small) level of nesting.

This could be avoided with an anti-deadlock pool. If a task
attempts a nested kmap and already holds a kmap, then give it
exclusive access to this pool until it releases its last
nested kmap.

Linus Torvalds

unread,

Jun 9, 2009, 1:20:10 PM6/9/09

to

On Tue, 9 Jun 2009, Nick Piggin wrote:

> On Tue, Jun 09, 2009 at 09:26:47AM -0700, Linus Torvalds wrote:
> >
> >
> > On Tue, 9 Jun 2009, Nick Piggin wrote:
> > >
> > > The idea seems nice but isn't the problem that kmap gives back a
> > > basically 1st class kernel virtual memory? (ie. it can then be used
> > > by any other CPU at any point without it having to use kmap?).
> >
> > No, everybody has to use kmap()/kunmap().
>
> So it is strictly a bug to expose a pointer returned by kmap to
> another CPU?

No, not at all. The pointers are all global. They have to be, since the
original kmap() user may well be scheduled away.

> > The "problem" is that you could in theory run out of kmap frames, since if
> > everybody does a kmap() in an interruptible context and you have lots and
> > lots of threads doing different pages, you'd run out. But that has nothing
> > to do with kmap_atomic(), which is basically limited to just the number of
> > CPU's and a (very small) level of nesting.
>
> This could be avoided with an anti-deadlock pool. If a task
> attempts a nested kmap and already holds a kmap, then give it
> exclusive access to this pool until it releases its last
> nested kmap.

We just sleep, waiting for somebody to release their. Again, that
obviously won't work in atomic context, but it's easy enough to just have
a "we need to have a few entries free" for the atomic case, and make it
busy-loop if it runs out (which is not going to happen in practice
anyway).

Linus

Linus Torvalds

unread,

Jun 9, 2009, 2:10:09 PM6/9/09

to

On Tue, 9 Jun 2009, Linus Torvalds wrote:
>
> And they'd be even less common if the whole "64-bit kernel even if you do
> a 32-bit distro" was more common.

Side note: intel is to blame too. I think several Atom versions were
shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
artifically crippled to just 32-bit mode.

Linus

Linus Torvalds

unread,

Jun 9, 2009, 2:10:14 PM6/9/09

to

On Tue, 9 Jun 2009, H. Peter Anvin wrote:
>
> A major problem is that distros don't seem to be willing to push 64-bit
> kernels for 32-bit distros. There are a number of good (and
> not-so-good) reasons why users may want to run a 32-bit userspace, but
> not running a 64-bit kernel on capable hardware is just problematic.

Yeah, that's just stupid. A 64-bit kernel should work well with 32-bit
tools, and while we've occasionally had compat issues (the intel gfx
people used to claim that they needed to work with a 32-bit kernel because
they cared about 32-bit tools), they aren't unfixable or even all _that_
common.

And they'd be even less common if the whole "64-bit kernel even if you do
a 32-bit distro" was more common.

The nice thing about a 64-bit kernel is that you should be able to build
one even if you don't in general have all the 64-bit libraries. So you
don't need a full 64-bit development environment, you just need a compiler
that can generate code for both (and that should be the default on x86
these days).

Linus

H. Peter Anvin

unread,

Jun 9, 2009, 2:10:12 PM6/9/09

to

Ingo Molnar wrote:
>
> OTOH, highmem is clearly a useful hardware enablement feature with a
> slowly receding upside and a constant downside. The outcome is
> clear: when a critical threshold is reached distros will stop
> enabling it. (or more likely, there will be pure 64-bit x86 distros)
>

A major problem is that distros don't seem to be willing to push 64-bit

kernels for 32-bit distros. There are a number of good (and
not-so-good) reasons why users may want to run a 32-bit userspace, but
not running a 64-bit kernel on capable hardware is just problematic.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Matthew Garrett

unread,

Jun 9, 2009, 7:00:12 PM6/9/09

to

On Tue, Jun 09, 2009 at 11:07:41AM -0700, Linus Torvalds wrote:

> Side note: intel is to blame too. I think several Atom versions were
> shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
> artifically crippled to just 32-bit mode.

And some people still want to run dosemu so they can drive their
godforsaken 80s era PIO driven data analyzer. It'd be nice to think that
nobody used vm86, but they always seem to pop out of the woodwork
whenever someone suggests 64-bit kernels by default.

--
Matthew Garrett | mj...@srcf.ucam.org

H. Peter Anvin

unread,

Jun 9, 2009, 7:10:05 PM6/9/09

to

Matthew Garrett wrote:
> On Tue, Jun 09, 2009 at 11:07:41AM -0700, Linus Torvalds wrote:
>
>> Side note: intel is to blame too. I think several Atom versions were
>> shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
>> artifically crippled to just 32-bit mode.
>
> And some people still want to run dosemu so they can drive their
> godforsaken 80s era PIO driven data analyzer. It'd be nice to think that
> nobody used vm86, but they always seem to pop out of the woodwork
> whenever someone suggests 64-bit kernels by default.
>

There is both KVM and Qemu as alternatives, though. The godforsaken
80s-era PIO driven data analyzer will run fine in Qemu even on
non-HVM-capable hardware if it's 64-bit capable. Most of the time it'll
spend sitting in PIO no matter what you do.

-hpa

Paul Mackerras

unread,

Jun 9, 2009, 8:10:06 PM6/9/09

to

Ingo Molnar writes:

> I did this fork overhead measurement some time ago, using
> perfcounters and 'perf':

Could you post the program? I'd like to try it on some systems here.

Paul.

Ingo Molnar

unread,

Jun 9, 2009, 9:30:11 PM6/9/09

to

* Paul Mackerras <pau...@samba.org> wrote:

> Ingo Molnar writes:
>
> > I did this fork overhead measurement some time ago, using
> > perfcounters and 'perf':
>
> Could you post the program? I'd like to try it on some systems
> here.

I still have it, it was something really, really simple and silly:

int main(void)
{
int i;

for (i = 0; i < 8; i++)
if (!fork())
wait(0);
}

Ingo

Nick Piggin

unread,

Jun 10, 2009, 2:00:17 AM6/10/09

to

On Tue, Jun 09, 2009 at 10:08:53AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
>
> > On Tue, Jun 09, 2009 at 09:26:47AM -0700, Linus Torvalds wrote:
> > >
> > >
> > > On Tue, 9 Jun 2009, Nick Piggin wrote:
> > > >
> > > > The idea seems nice but isn't the problem that kmap gives back a
> > > > basically 1st class kernel virtual memory? (ie. it can then be used
> > > > by any other CPU at any point without it having to use kmap?).
> > >
> > > No, everybody has to use kmap()/kunmap().
> >
> > So it is strictly a bug to expose a pointer returned by kmap to
> > another CPU?
>
> No, not at all. The pointers are all global. They have to be, since the
> original kmap() user may well be scheduled away.

Sorry, I meant another task.

> > > The "problem" is that you could in theory run out of kmap frames, since if
> > > everybody does a kmap() in an interruptible context and you have lots and
> > > lots of threads doing different pages, you'd run out. But that has nothing
> > > to do with kmap_atomic(), which is basically limited to just the number of
> > > CPU's and a (very small) level of nesting.
> >
> > This could be avoided with an anti-deadlock pool. If a task
> > attempts a nested kmap and already holds a kmap, then give it
> > exclusive access to this pool until it releases its last
> > nested kmap.
>
> We just sleep, waiting for somebody to release their. Again, that
> obviously won't work in atomic context, but it's easy enough to just have
> a "we need to have a few entries free" for the atomic case, and make it
> busy-loop if it runs out (which is not going to happen in practice
> anyway).

The really theoretical one (which Andrew likes complaining about) is
when *everybody* is holding a kmap and asking for another one ;)
But I think it isn't too hard to make a pool for that. And yes we'd
also need a pool for atomic kmaps as you point out.

Peter Zijlstra

unread,

Jun 10, 2009, 2:30:09 AM6/10/09

to

On Tue, 2009-06-09 at 09:26 -0700, Linus Torvalds wrote:
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > The idea seems nice but isn't the problem that kmap gives back a
> > basically 1st class kernel virtual memory? (ie. it can then be used
> > by any other CPU at any point without it having to use kmap?).
>
> No, everybody has to use kmap()/kunmap().
>
> The "problem" is that you could in theory run out of kmap frames, since if
> everybody does a kmap() in an interruptible context and you have lots and
> lots of threads doing different pages, you'd run out. But that has nothing
> to do with kmap_atomic(), which is basically limited to just the number of
> CPU's and a (very small) level of nesting.

One of the things I did for -rt back when I rewrote mm/highmem.c for it
was to reserve multiple slots per kmap() user so that if you did 1 you
could always do another.

With everything in task context like rt does 2 seemed enough, but you
cuold ways extend that scheme and reserve enough for the worst case
nesting depth and be done with it.

Pavel Machek

unread,

Jun 17, 2009, 5:50:13 AM6/17/09

to

Hi!

> > > > The "problem" is that you could in theory run out of kmap frames, since if
> > > > everybody does a kmap() in an interruptible context and you have lots and
> > > > lots of threads doing different pages, you'd run out. But that has nothing
> > > > to do with kmap_atomic(), which is basically limited to just the number of
> > > > CPU's and a (very small) level of nesting.
> > >
> > > This could be avoided with an anti-deadlock pool. If a task
> > > attempts a nested kmap and already holds a kmap, then give it
> > > exclusive access to this pool until it releases its last
> > > nested kmap.
> >
> > We just sleep, waiting for somebody to release their. Again, that
> > obviously won't work in atomic context, but it's easy enough to just have
> > a "we need to have a few entries free" for the atomic case, and make it
> > busy-loop if it runs out (which is not going to happen in practice
> > anyway).
>
> The really theoretical one (which Andrew likes complaining about) is
> when *everybody* is holding a kmap and asking for another one ;)
> But I think it isn't too hard to make a pool for that. And yes we'd

Does one pool help?

Now you can have '*everyone* is holding the kmaps and is asking for
another one'.

You could add as many pools as maximum nesting level... Is there
maximum nesting level?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Nick Piggin

unread,

Jun 17, 2009, 6:00:13 AM6/17/09

to

On Wed, Jun 17, 2009 at 11:40:16AM +0200, Pavel Machek wrote:
> Hi!
>
> > > > > The "problem" is that you could in theory run out of kmap frames, since if
> > > > > everybody does a kmap() in an interruptible context and you have lots and
> > > > > lots of threads doing different pages, you'd run out. But that has nothing
> > > > > to do with kmap_atomic(), which is basically limited to just the number of
> > > > > CPU's and a (very small) level of nesting.
> > > >
> > > > This could be avoided with an anti-deadlock pool. If a task
> > > > attempts a nested kmap and already holds a kmap, then give it
> > > > exclusive access to this pool until it releases its last
> > > > nested kmap.
> > >
> > > We just sleep, waiting for somebody to release their. Again, that
> > > obviously won't work in atomic context, but it's easy enough to just have
> > > a "we need to have a few entries free" for the atomic case, and make it
> > > busy-loop if it runs out (which is not going to happen in practice
> > > anyway).
> >
> > The really theoretical one (which Andrew likes complaining about) is
> > when *everybody* is holding a kmap and asking for another one ;)
> > But I think it isn't too hard to make a pool for that. And yes we'd
>
> Does one pool help?

So long as only one process is allowed access to the pool at
one time, yes I think it solves it. It would probably never
even hit in practice, so synchronization overhead would not
matter.

> Now you can have '*everyone* is holding the kmaps and is asking for
> another one'.
>
> You could add as many pools as maximum nesting level... Is there
> maximum nesting level?

Yes there are only a set number of kmap_atomic nesting levels,
so if you converted them all to kmap then it would be that + 1.