This version patches rebases on 3.5-rc2 kernel. It does some code clean up,
change patches sequency to make it looks more logical.
It also includes xen code change according to Jan Beulich's comments,
includes comments update of tlb flushing IPI according to PeterZ's suggestions.
And Adding the flush_tlb_kernel_range support by invlpg.
Thanks for all comments! and appreciate for more. :)
Alex Shi
[PATCH v8 1/8] x86/tlb_info: get last level TLB entry number of CPU
[PATCH v8 2/8] x86/flush_tlb: try flush_tlb_single one by one in
[PATCH v8 3/8] x86/tlb: fall back to flush all when meet a THP large
[PATCH v8 4/8] x86/tlb: add tlb_flushall_shift for specific CPU
[PATCH v8 5/8] x86/tlb: add tlb_flushall_shift knob into debugfs
[PATCH v8 6/8] x86/tlb: enable tlb flush range support for generic
[PATCH v8 7/8] x86/tlb: replace INVALIDATE_TLB_VECTOR by
[PATCH v8 8/8] x86/tlb: do flush_tlb_kernel_range by 'invlpg'
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
For 4KB pages, x86 CPU has 2 or 1 level TLB, first level is data TLB and
instruction TLB, second level is shared TLB for both data and instructions.
For hupe page TLB, usually there is just one level and seperated by 2MB/4MB
and 1GB.
Although each levels TLB size is important for performance tuning, but for
genernal and rude optimizing, last level TLB entry number is suitable. And
in fact, last level TLB always has the biggest entry number.
This patch will get the biggest TLB entry number and use it in furture TLB
optimizing.
Accroding Borislav's suggestion, except tlb_ll[i/d]_* array, other
function and data will be released after system boot up.
For all kinds of x86 vendor friendly, vendor specific code was moved to its
specific files.
x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.
But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.
So, on a 512 4KB TLB entries CPU, the balance points is at:
(512 - X) * 100ns(assumed TLB refill cost) =
X(TLB flush entries) * 100ns(assumed invlpg cost)
Here, X is 256, that is 1/2 of 512 entries.
But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.
As to hugetlb, guess due to smaller page table, and smaller active TLB
entries, I didn't see benefit via my benchmark, so no optimizing now.
My macro benchmark show in ideal scenarios, the performance improves 70
percent in reading. And in worst scenario, the reading/writing
performance is similar with unpatched 3.4-rc4 kernel.
Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
'always':
static inline void flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm,
- unsigned long va)
+ unsigned long start,
+ unsigned long end)
{
- PVOP_VCALL3(pv_mmu_ops.flush_tlb_others, cpumask, mm, va);
+ PVOP_VCALL4(pv_mmu_ops.flush_tlb_others, cpumask, mm, start, end);
}
static inline int paravirt_pgd_alloc(struct mm_struct *mm)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8e8b9a4..600a5fcac9 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -250,7 +250,8 @@ struct pv_mmu_ops {
void (*flush_tlb_single)(unsigned long addr);
void (*flush_tlb_others)(const struct cpumask *cpus,
struct mm_struct *mm,
- unsigned long va);
+ unsigned long start,
+ unsigned long end);
/* Hooks for allocating and freeing a pagetable top-level */
int (*pgd_alloc)(struct mm_struct *mm);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 36a1a2a..33608d9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -73,14 +73,10 @@ static inline void __flush_tlb_one(unsigned long addr)
* - flush_tlb_page(vma, vmaddr) flushes one page
* - flush_tlb_range(vma, start, end) flushes a range of pages
* - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- * - flush_tlb_others(cpumask, mm, va) flushes TLBs on other cpus
+ * - flush_tlb_others(cpumask, mm, start, end) flushes TLBs on other cpus
*
* ..but the i386 has somewhat limited tlb flushing capabilities,
* and page-granular flushes are available only on i486 and up.
- *
- * x86-64 can only flush individual pages or full VMs. For a range flush
- * we always do the full VM. Might be worth trying if for a small
- * range a few INVLPGs in a row are a win.
*/
void native_flush_tlb_others(const struct cpumask *cpumask,
- struct mm_struct *mm, unsigned long va)
+ struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
if (is_uv_system()) {
unsigned int cpu;
cpu = smp_processor_id();
- cpumask = uv_flush_tlb_others(cpumask, mm, va, cpu);
+ cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
if (cpumask)
- flush_tlb_others_ipi(cpumask, mm, va);
+ flush_tlb_others_ipi(cpumask, mm, start, end);
return;
}
- flush_tlb_others_ipi(cpumask, mm, va);
+ flush_tlb_others_ipi(cpumask, mm, start, end);
}
...
We don't need to flush large pages by PAGE_SIZE step, that just waste
time. and actually, large page don't need 'invlpg' optimizing according
to our macro benchmark. So, just flush whole TLB is enough for them.
The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
with THP 'always' setting.
Multi-thread testing, '-t' paramter is thread number:
without this patch with this patch
/mprotect -t 1 14ns 13ns
/mprotect -t 2 13ns 13ns
/mprotect -t 4 12ns 11ns
/mprotect -t 8 14ns 10ns
/mprotect -t 16 28ns 28ns
/mprotect -t 32 54ns 52ns
/mprotect -t 128 200ns 200ns
Signed-off-by: Alex Shi <alex....@intel.com>
---
arch/x86/mm/tlb.c | 34 ++++++++++++++++++++++++++++++++++
1 files changed, 34 insertions(+), 0 deletions(-)
Testing show different CPU type(micro architectures and NUMA mode) has
different balance points between the TLB flush all and multiple invlpg.
And there also has cases the tlb flush change has no any help.
This patch give a interface to let x86 vendor developers have a chance
to set different shift for different CPU type.
like some machine in my hands, balance points is 16 entries on
Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
help.
For untested machine, do a conservative optimization, same as NHM CPU.
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 39b2bd4..d048cad 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -72,6 +72,8 @@ extern u16 __read_mostly tlb_lli_4m[NR_INFO];
extern u16 __read_mostly tlb_lld_4k[NR_INFO];
extern u16 __read_mostly tlb_lld_2m[NR_INFO];
extern u16 __read_mostly tlb_lld_4m[NR_INFO];
+extern s8 __read_mostly tlb_flushall_shift;
+
/*
* CPU type and hardware bug flags. Kept separately for each CPU.
* Members of this structure are referenced in head.S, so think twice
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 41acb77..39a6b01 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -459,16 +459,26 @@ u16 __read_mostly tlb_lld_4k[NR_INFO];
u16 __read_mostly tlb_lld_2m[NR_INFO];
u16 __read_mostly tlb_lld_4m[NR_INFO];
+/*
+ * tlb_flushall_shift shows the balance point in replacing cr3 write
+ * with multiple 'invlpg'. It will do this replacement when
+ * flush_tlb_lines <= active_lines/2^tlb_flushall_shift.
+ * If tlb_flushall_shift is -1, means the replacement will be disabled.
+ */
+s8 __read_mostly tlb_flushall_shift = -1;
+
void __cpuinit cpu_detect_tlb(struct cpuinfo_x86 *c)
{
if (c->x86_vendor == X86_VENDOR_INTEL)
intel_cpu_detect_tlb(c);
kernel will replace cr3 rewrite with invlpg when
tlb_flush_entries <= active_tlb_entries / 2^tlb_flushall_factor
if tlb_flushall_factor is -1, kernel won't do this replacement.
User can modify its value according to specific CPU/applications.
Thanks for Borislav providing the help message of
CONFIG_DEBUG_TLBFLUSH.
Signed-off-by: Alex Shi <alex....@intel.com>
---
arch/x86/Kconfig.debug | 19 +++++++++++++++++
arch/x86/mm/tlb.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+), 0 deletions(-)
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index e46c214..b322f12 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -129,6 +129,25 @@ config DOUBLEFAULT
option saves about 4k and might cause you much additional grey
hair.
+config DEBUG_TLBFLUSH
+ bool "Set upper limit of TLB entries to flush one-by-one"
+ depends on DEBUG_KERNEL && (X86_64 || X86_INVLPG)
+ ---help---
+
+ X86-only for now.
+
+ This option allows the user to tune the amount of TLB entries the
+ kernel flushes one-by-one instead of doing a full TLB flush. In
+ certain situations, the former is cheaper. This is controlled by the
+ tlb_flushall_shift knob under /sys/kernel/debug/x86. If you set it
+ to -1, the code flushes the whole TLB unconditionally. Otherwise,
+ for positive values of it, the kernel will use single TLB entry
+ invalidating instructions according to the following formula:
+
+ flush_entries <= active_tlb_entries / 2^tlb_flushall_shift
+
+ If in doubt, say "N".
+
config IOMMU_DEBUG
bool "Enable IOMMU debugging"
depends on GART_IOMMU && DEBUG_KERNEL
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2939f2f..5911f61 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
#include <asm/cache.h>
#include <asm/apic.h>
#include <asm/uv/uv.h>
+#include <linux/debugfs.h>
Not every tlb_flush execution moment is really need to evacuate all
TLB entries, like in munmap, just few 'invlpg' is better for whole
process performance, since it leaves most of TLB entries for later
accessing.
All of tlb interfaces in mm/memory.c are reused by all architectures
except few of them for generic mmu_gather were protected under
HAVE_GENERIC_MMU_GATHER. So, add flush range start/end under
HAVE_GENERIC_MMU_GATHER for generic mmu implementation.
For x86, just need to re-implement tlb_flush().
This patch also rewrite flush_tlb_range for 2 purposes:
1, split it out to get flush_blt_mm_range function.
2, clean up to reduce line breaking, thanks for Borislav's input.
My macro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59 show that the random memory access on other CPU has 0~50% speed up
on a 2P * 4cores * HT NHM EP while do 'munmap'.
Thanks for Peter Zijlstra time and time reminder for multiple
architecture code safe!
This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
and gain was analyzed in my patch (x86/flush_tlb: try flush_tlb_single
one by one in flush_tlb_range). Now we move this logical into kernel
part. The pay is multiple 'invlpg' execution cost, that is same. but
the gain(cost reducing of TLB entries refilling) is absoulutely
increased.
Signed-off-by: Alex Shi <alex....@intel.com>
---
arch/x86/include/asm/tlbflush.h | 13 +++++++------
arch/x86/mm/tlb.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+), 6 deletions(-)
There are only 32 INVALIDATE_TLB_VECTOR now in kernel.
but modern x86 sever has more cpu number. That causes heavy
lock contentation in TLB flushing.
Now, useing generic smp call function to replace it.
In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
has 3% performance increase.
And no clear performance changes on NHM EP(16CPUs), WSM EP(24CPU)
and SNB EP(32CPU) machines.
- /* IPIs for invalidation */
-#define ALLOC_INVTLB_VEC(NR) \
- alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+NR, \
- invalidate_interrupt##NR)
-
- switch (NUM_INVALIDATE_TLB_VECTORS) {
- default:
- ALLOC_INVTLB_VEC(31);
- case 31:
- ALLOC_INVTLB_VEC(30);
- case 30:
- ALLOC_INVTLB_VEC(29);
- case 29:
- ALLOC_INVTLB_VEC(28);
- case 28:
- ALLOC_INVTLB_VEC(27);
- case 27:
- ALLOC_INVTLB_VEC(26);
- case 26:
- ALLOC_INVTLB_VEC(25);
- case 25:
- ALLOC_INVTLB_VEC(24);
- case 24:
- ALLOC_INVTLB_VEC(23);
- case 23:
- ALLOC_INVTLB_VEC(22);
- case 22:
- ALLOC_INVTLB_VEC(21);
- case 21:
- ALLOC_INVTLB_VEC(20);
- case 20:
- ALLOC_INVTLB_VEC(19);
- case 19:
- ALLOC_INVTLB_VEC(18);
- case 18:
- ALLOC_INVTLB_VEC(17);
- case 17:
- ALLOC_INVTLB_VEC(16);
- case 16:
- ALLOC_INVTLB_VEC(15);
- case 15:
- ALLOC_INVTLB_VEC(14);
- case 14:
- ALLOC_INVTLB_VEC(13);
- case 13:
- ALLOC_INVTLB_VEC(12);
- case 12:
- ALLOC_INVTLB_VEC(11);
- case 11:
- ALLOC_INVTLB_VEC(10);
- case 10:
- ALLOC_INVTLB_VEC(9);
- case 9:
- ALLOC_INVTLB_VEC(8);
- case 8:
- ALLOC_INVTLB_VEC(7);
- case 7:
- ALLOC_INVTLB_VEC(6);
- case 6:
- ALLOC_INVTLB_VEC(5);
- case 5:
- ALLOC_INVTLB_VEC(4);
- case 4:
- ALLOC_INVTLB_VEC(3);
- case 3:
- ALLOC_INVTLB_VEC(2);
- case 2:
- ALLOC_INVTLB_VEC(1);
- case 1:
- ALLOC_INVTLB_VEC(0);
- break;
- }
-
/* IPI for generic function call */
alloc_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 481737d..2b5f506 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,34 +28,14 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)
*
* More scalable flush, from Andi Kleen
*
- * To avoid global state use 8 different call vectors.
- * Each CPU uses a specific vector to trigger flushes on other
- * CPUs. Depending on the received vector the target CPUs look into
- * the right array slot for the flush data.
- *
- * With more than 8 CPUs they are hashed to the 8 available
- * vectors. The limited global vector space forces us to this right now.
- * In future when interrupts are split into per CPU domains this could be
- * fixed, at the cost of triggering multiple IPIs in some cases.
+ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/
-union smp_flush_state {
- struct {
- struct mm_struct *flush_mm;
- unsigned long flush_start;
- unsigned long flush_end;
- raw_spinlock_t tlbstate_lock;
- DECLARE_BITMAP(flush_cpumask, NR_CPUS);
- };
- char pad[INTERNODE_CACHE_BYTES];
-} ____cacheline_internodealigned_in_smp;
-
-/* State is put into the per CPU data section, but padded
- to a full cache line because other CPUs can access it and we don't
- want false sharing in the per cpu data segment. */
-static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
-
-static DEFINE_PER_CPU_READ_MOSTLY(int, tlb_vector_offset);
+struct flush_tlb_info {
+ struct mm_struct *flush_mm;
+ unsigned long flush_start;
+ unsigned long flush_end;
+};
/*
* We cannot call mmdrop() because we are in interrupt context,
@@ -74,28 +54,25 @@ void leave_mm(int cpu)
EXPORT_SYMBOL_GPL(leave_mm);
/*
- *
* The flush IPI assumes that a thread switch happens in this order:
* [cpu0: the cpu that switches]
* 1) switch_mm() either 1a) or 1b)
* 1a) thread switch to a different mm
- * 1a1) cpu_clear(cpu, old_mm->cpu_vm_mask);
- * Stop ipi delivery for the old mm. This is not synchronized with
- * the other cpus, but smp_invalidate_interrupt ignore flush ipis
- * for the wrong mm, and in the worst case we perform a superfluous
- * tlb flush.
- * 1a2) set cpu mmu_state to TLBSTATE_OK
- * Now the smp_invalidate_interrupt won't call leave_mm if cpu0
- * was in lazy tlb mode.
- * 1a3) update cpu active_mm
+ * 1a1) set cpu_tlbstate to TLBSTATE_OK
+ * Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
+ * if cpu0 was in lazy tlb mode.
+ * 1a2) update cpu active_mm
* Now cpu0 accepts tlb flushes for the new mm.
- * 1a4) cpu_set(cpu, new_mm->cpu_vm_mask);
+ * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
* Now the other cpus will send tlb flush ipis.
* 1a4) change cr3.
+ * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
+ * Stop ipi delivery for the old mm. This is not synchronized with
+ * the other cpus, but flush_tlb_func ignore flush ipis for the wrong
+ * mm, and in the worst case we perform a superfluous tlb flush.
* 1b) thread switch without mm change
- * cpu active_mm is correct, cpu0 already handles
- * flush ipis.
- * 1b1) set cpu mmu_state to TLBSTATE_OK
+ * cpu active_mm is correct, cpu0 already handles flush ipis.
+ * 1b1) set cpu_tlbstate to TLBSTATE_OK
* 1b2) test_and_set the cpu bit in cpu_vm_mask.
* Atomically set the bit [other cpus will start sending flush ipis],
* and test the bit.
@@ -108,186 +85,61 @@ EXPORT_SYMBOL_GPL(leave_mm);
* runs in kernel space, the cpu could load tlb entries for user space
* pages.
*
- * The good news is that cpu mmu_state is local to each cpu, no
+ * The good news is that cpu_tlbstate is local to each cpu, no
* write/read ordering problems.
*/
/*
- * TLB flush IPI:
- *
+ * TLB flush funcation:
* 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
* 2) Leave the mm if we are in the lazy tlb mode.
- *
- * Interrupts are disabled.
- */
-
-/*
- * FIXME: use of asmlinkage is not consistent. On x86_64 it's noop
- * but still used for documentation purpose but the usage is slightly
- * inconsistent. On x86_32, asmlinkage is regparm(0) but interrupt
- * entry calls in with the first parameter in %eax. Maybe define
- * intrlinkage?
*/
-#ifdef CONFIG_X86_64
-asmlinkage
-#endif
-void smp_invalidate_interrupt(struct pt_regs *regs)
+static void flush_tlb_func(void *info)
{
- unsigned int cpu;
- unsigned int sender;
- union smp_flush_state *f;
-
- cpu = smp_processor_id();
- /*
- * orig_rax contains the negated interrupt vector.
- * Use that to determine where the sender put the data.
- */
- sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
- f = &flush_state[sender];
-
- if (!cpumask_test_cpu(cpu, to_cpumask(f->flush_cpumask)))
- goto out;
- /*
- * This was a BUG() but until someone can quote me the
- * line from the intel manual that guarantees an IPI to
...
> This version patches rebases on 3.5-rc2 kernel. It does some code clean up,
> change patches sequency to make it looks more logical.
> It also includes xen code change according to Jan Beulich's comments,
> includes comments update of tlb flushing IPI according to PeterZ's suggestions.
> And Adding the flush_tlb_kernel_range support by invlpg.
> Thanks for all comments! and appreciate for more. :)
Any comments for this patch set. Specially on the 8th patch?
> Alex Shi
> [PATCH v8 1/8] x86/tlb_info: get last level TLB entry number of CPU
> [PATCH v8 2/8] x86/flush_tlb: try flush_tlb_single one by one in
> [PATCH v8 3/8] x86/tlb: fall back to flush all when meet a THP large
> [PATCH v8 4/8] x86/tlb: add tlb_flushall_shift for specific CPU
> [PATCH v8 5/8] x86/tlb: add tlb_flushall_shift knob into debugfs
> [PATCH v8 6/8] x86/tlb: enable tlb flush range support for generic
> [PATCH v8 7/8] x86/tlb: replace INVALIDATE_TLB_VECTOR by
> [PATCH v8 8/8] x86/tlb: do flush_tlb_kernel_range by 'invlpg'
On Tue, Jun 12, 2012 at 05:06:45PM +0800, Alex Shi wrote:
> This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
> and gain was analyzed in my patch (x86/flush_tlb: try flush_tlb_single
> one by one in flush_tlb_range). Now we move this logical into kernel
> part. The pay is multiple 'invlpg' execution cost, that is same. but
> the gain(cost reducing of TLB entries refilling) is absoulutely
> increased.
The subtle point is whether INVLPG flushes global pages or not.
After some digging I found a sentence in the SDM that says it does.
So it may be safe.
> On Tue, Jun 12, 2012 at 05:06:45PM +0800, Alex Shi wrote:
>> This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
>> and gain was analysed in my patch (x86/flush_tlb: try flush_tlb_single
>> one by one in flush_tlb_range). Now we move this logical into kernel
>> part. The pay is multiple 'invlpg' execution cost, that is same. but
>> the gain(cost reducing of TLB entries refilling) is absolutely
>> increased.
> The subtle point is whether INVLPG flushes global pages or not.
> After some digging I found a sentence in the SDM that says it does.
> So it may be safe.
Many thanks for your time!
> What does it improve?
I have not specific benchmark for this. partly due to the gain theory
was proved since it is same as previous user process's page table flush.
The user of tlb kernel flush in kernel is vmalloc. and Android binder
IPC subsystem is using it(drivers/staging/android/binder.c)
I am wondering if it can help Andriod on this?
So, add cc to android-kernel@googlegroups.com
>> On Tue, Jun 12, 2012 at 05:06:45PM +0800, Alex Shi wrote:
>>> This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
>>> and gain was analysed in my patch (x86/flush_tlb: try flush_tlb_single
>>> one by one in flush_tlb_range). Now we move this logical into kernel
>>> part. The pay is multiple 'invlpg' execution cost, that is same. but
>>> the gain(cost reducing of TLB entries refilling) is absolutely
>>> increased.
>> The subtle point is whether INVLPG flushes global pages or not.
>> After some digging I found a sentence in the SDM that says it does.
>> So it may be safe.
> Many thanks for your time!
>> What does it improve?
> I have not specific benchmark for this. partly due to the gain theory
> was proved since it is same as previous user process's page table flush.
> The user of tlb kernel flush in kernel is vmalloc. and Android binder
> IPC subsystem is using it(drivers/staging/android/binder.c)
> I am wondering if it can help Andriod on this?
> So, add cc to android-kernel@googlegroups.com
Sorry, Andriod reject posting without register, so cc to
linux-o...@vger.kernel.org and linux-te...@vger.kernel.org instead.
>>> On Tue, Jun 12, 2012 at 05:06:45PM +0800, Alex Shi wrote:
>>>> This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
>>>> and gain was analysed in my patch (x86/flush_tlb: try flush_tlb_single
>>>> one by one in flush_tlb_range). Now we move this logical into kernel
>>>> part. The pay is multiple 'invlpg' execution cost, that is same. but
>>>> the gain(cost reducing of TLB entries refilling) is absolutely
>>>> increased.
>>> The subtle point is whether INVLPG flushes global pages or not.
>>> After some digging I found a sentence in the SDM that says it does.
>>> So it may be safe.
>> Many thanks for your time!
>>> What does it improve?
>> I have not specific benchmark for this. partly due to the gain theory
>> was proved since it is same as previous user process's page table flush.
>> The user of tlb kernel flush in kernel is vmalloc. and Android binder
>> IPC subsystem is using it(drivers/staging/android/binder.c)
>> I am wondering if it can help Andriod on this?
>> So, add cc to android-kernel@googlegroups.com
> Sorry, Andriod reject posting without register, so cc to
> linux-o...@vger.kernel.org and linux-te...@vger.kernel.org instead.
Ops, forget the architecture different again
This will help x86 android, not arm system. Forget above 2 mailing lists. :(
>>> On Tue, Jun 12, 2012 at 05:06:45PM +0800, Alex Shi wrote:
>>>> This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
>>>> and gain was analysed in my patch (x86/flush_tlb: try flush_tlb_single
>>>> one by one in flush_tlb_range). Now we move this logical into kernel
>>>> part. The pay is multiple 'invlpg' execution cost, that is same. but
>>>> the gain(cost reducing of TLB entries refilling) is absolutely
>>>> increased.
>>> The subtle point is whether INVLPG flushes global pages or not.
>>> After some digging I found a sentence in the SDM that says it does.
>>> So it may be safe.
>> Many thanks for your time!
>>> What does it improve?
I just write a rough kernel modules that alloc some page arrays in kernel and then map to vaddr by 'vmap'.
Then my macro benchmark inject a 'unmap_kernel_range' request from a sysfs interface, and doing random memory access in user level during the time.
On my NHM EP 2P * 4 Cores * HT.
Without this patch, the memory access with 4 threads is ~12ns/time.
With this patch, the memory access with 4 threads is ~9ns/time.
With threads number increasing the benefit becomes small and nearly disappeared after thread number up to 256.
But no any regression.
The rough user macro-benchmark and kernel module is here:
/*
maccess.c
This is a macrobenchmark for TLB flush range testing.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Copyright (C) Intel 2012
Coypright Alex Shi alex....@intel.com
//data for threads
struct data{
int pagenum;
void *startaddr;
int rw;
int loop;
};
volatile int * threadstart;
//thread for memory accessing
void *accessmm(void *data){
struct data *d = data;
long *actimes;
char x;
int i, k;
int randn[PAGE_SIZE];
for (i=0;i<PAGE_SIZE; i++)
randn[i] = rand();
actimes = malloc(sizeof(long));
while (*threadstart == 0 )
usleep(1);
if (d->rw == 0)
for (*actimes=0; *threadstart == 1; (*actimes)++)
for (k=0; k < d->pagenum; k++)
x = *(volatile char *)(d->startaddr + randn[k]%FILE_SIZE); else
for (*actimes=0; *threadstart == 1; (*actimes)++)
for (k=0; k < d->pagenum; k++)
*(char *)(d->startaddr + randn[k]%FILE_SIZE) = 1; return actimes;
}
int main(int argc, char *argv[])
{
static char optstr[] = "p:w:ht:s:";
int s = 1; /* */
int p = 512; /* default accessed page number, after maccess */
int er = 0, rw = 0, h = 0, t = 2; /* d: debug; h: use huge page; t thread number */
int pagesize = PAGE_SIZE; /*default for regular page */
volatile char x;
long protindex = 0;
int i, j, k, c;
void *m1, *startaddr;
unsigned long *startaddr2[1024*512];
volatile void *tempaddr;
clockid_t clockid = CLOCK_MONOTONIC;
unsigned long start, stop, mptime, actime;
int randn[PAGE_SIZE];
pthread_t pid[1024];
void * res;
struct data data;
char command[1024];
for (i=0;i<PAGE_SIZE; i++)
randn[i] = rand();
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 's':
s = atoi(optarg);
break;
case 'p':
p = atoi(optarg);
break;
case 'h':
h = 1;
break;
case 'w':
rw = atoi(optarg);
break;
case 't':
t = atoi(optarg);
break;
case '?':
er = 1;
break;
}
if (er) {
printf("usage: %s %s\n", argv[0], optstr);
exit(1);
}
>> This version patches rebases on 3.5-rc2 kernel. It does some code clean up,
>> change patches sequency to make it looks more logical.
>> It also includes xen code change according to Jan Beulich's comments,
>> includes comments update of tlb flushing IPI according to PeterZ's suggestions.
>> And Adding the flush_tlb_kernel_range support by invlpg.
>> Thanks for all comments! and appreciate for more. :)
> Any comments for this patch set. Specially on the 8th patch?
If no more comments, anyone like to pick up this patch set?
It does some performance and scalability help on x86 machine.
>> Alex Shi
>> [PATCH v8 1/8] x86/tlb_info: get last level TLB entry number of CPU
>> [PATCH v8 2/8] x86/flush_tlb: try flush_tlb_single one by one in
>> [PATCH v8 3/8] x86/tlb: fall back to flush all when meet a THP large
>> [PATCH v8 4/8] x86/tlb: add tlb_flushall_shift for specific CPU
>> [PATCH v8 5/8] x86/tlb: add tlb_flushall_shift knob into debugfs
>> [PATCH v8 6/8] x86/tlb: enable tlb flush range support for generic
>> [PATCH v8 7/8] x86/tlb: replace INVALIDATE_TLB_VECTOR by
>> [PATCH v8 8/8] x86/tlb: do flush_tlb_kernel_range by 'invlpg'