[PATCH] Enhance perf to collect KVM guest os statistics from host side

Zhang, Yanmin

unread,

Mar 16, 2010, 1:30:02 AM3/16/10

to

From: Zhang, Yanmin <yanmin...@linux.intel.com>

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new subcommand kvm to perf.

perf kvm top
perf kvm record
perf kvm report
perf kvm diff

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

--------------------------------------------------------------------------------------------------------------------------
PerfTop: 16010 irqs/sec kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ _______________________

38770.00 20.4% __ticket_spin_lock [guest.kernel.kallsyms]
22560.00 11.9% ftrace_likely_update [kernel.kallsyms]
9208.00 4.8% __lock_acquire [kernel.kallsyms]
5473.00 2.9% trace_hardirqs_off_caller [kernel.kallsyms]
5222.00 2.7% copy_user_generic_string [guest.kernel.kallsyms]
4450.00 2.3% validate_chain [kernel.kallsyms]
4262.00 2.2% trace_hardirqs_on_caller [kernel.kallsyms]
4239.00 2.2% do_raw_spin_lock [kernel.kallsyms]
3548.00 1.9% do_raw_spin_unlock [kernel.kallsyms]
2487.00 1.3% lock_release [kernel.kallsyms]
2165.00 1.1% __local_bh_disable [kernel.kallsyms]
1905.00 1.0% check_chain_key [kernel.kallsyms]
1737.00 0.9% lock_acquire [kernel.kallsyms]
1604.00 0.8% tcp_recvmsg [kernel.kallsyms]
1524.00 0.8% mark_lock [kernel.kallsyms]
1464.00 0.8% schedule [kernel.kallsyms]
1423.00 0.7% __d_lookup [guest.kernel.kallsyms]

If you want to just show host data, pls. don't use parameter --guest.
The headline includes guest os kernel and userspace percentage.

2) perf kvm record
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules record -f -a sleep 60
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 29.385 MB perf.data.kvm (~1283837 samples) ]

3) perf kvm report
3.1) [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules report --sort pid --showcpuutilization>norm.host.guest.report.pid
# Samples: 2453796285126
#
# Overhead sys us guest sys guest us Command: Pid
# ........ .....................
#
43.67% 1.35% 0.01% 39.06% 3.26% qemu-system-x86: 3913
3.78% 3.58% 0.20% 0.00% 0.00% tbench:13519
3.69% 3.66% 0.03% 0.00% 0.00% tbench_srv:13526

Some performance guys require perf to show sys/us/guest_sys/guest_us per KVM guest
instance which is actually just a multi-threaded process. Above sub parameter --showcpuutilization
does so.

3.2) [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules report >norm.host.guest.report
# Samples: 2466991384118
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................................................................ ......
#
29.11% qemu-system-x86 [guest.kernel.kallsyms] [g] __ticket_spin_lock
5.88% tbench_srv [kernel.kallsyms] [k] ftrace_likely_update
5.76% tbench [kernel.kallsyms] [k] ftrace_likely_update
3.88% qemu-system-x86 34c3255482 [u] 0x000034c3255482
1.83% tbench [kernel.kallsyms] [k] __lock_acquire
1.81% tbench_srv [kernel.kallsyms] [k] __lock_acquire
1.38% tbench_srv [kernel.kallsyms] [k] trace_hardirqs_off_caller
1.37% tbench [kernel.kallsyms] [k] trace_hardirqs_off_caller
1.13% qemu-system-x86 [guest.kernel.kallsyms] [g] copy_user_generic_string
1.04% tbench_srv [kernel.kallsyms] [k] validate_chain
1.00% tbench [kernel.kallsyms] [k] trace_hardirqs_on_caller
1.00% tbench_srv [kernel.kallsyms] [k] trace_hardirqs_on_caller
0.95% tbench [kernel.kallsyms] [k] do_raw_spin_lock

[u] means it's in guest os user space. [g] means in guest os kernel. Other info is very direct.
If it shows a module such like [ext4], it means guest kernel module, because native host kernel's
modules are start from something like /lib/modules/XXX.

Below is the patch against tip/master tree of 15th March.

Signed-off-by: Zhang Yanmin <yanmin...@linux.intel.com>

---

diff -Nraup linux-2.6_tipmaster0315/arch/x86/include/asm/perf_event.h linux-2.6_tipmaster0315_perfkvm/arch/x86/include/asm/perf_event.h
--- linux-2.6_tipmaster0315/arch/x86/include/asm/perf_event.h 2010-03-16 08:59:11.533288951 +0800
+++ linux-2.6_tipmaster0315_perfkvm/arch/x86/include/asm/perf_event.h 2010-03-16 09:01:09.972117272 +0800
@@ -143,17 +143,10 @@ extern void perf_events_lapic_init(void)
*/
#define PERF_EFLAGS_EXACT (1UL << 3)

-#define perf_misc_flags(regs) \
-({ int misc = 0; \
- if (user_mode(regs)) \
- misc |= PERF_RECORD_MISC_USER; \
- else \
- misc |= PERF_RECORD_MISC_KERNEL; \
- if (regs->flags & PERF_EFLAGS_EXACT) \
- misc |= PERF_RECORD_MISC_EXACT; \
- misc; })
-
-#define perf_instruction_pointer(regs) ((regs)->ip)
+struct pt_regs;
+extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long perf_misc_flags(struct pt_regs *regs);
+#define perf_misc_flags(regs) perf_misc_flags(regs)

#else
static inline void init_hw_perf_events(void) { }
diff -Nraup linux-2.6_tipmaster0315/arch/x86/include/asm/ptrace.h linux-2.6_tipmaster0315_perfkvm/arch/x86/include/asm/ptrace.h
--- linux-2.6_tipmaster0315/arch/x86/include/asm/ptrace.h 2010-03-16 08:59:11.701271925 +0800
+++ linux-2.6_tipmaster0315_perfkvm/arch/x86/include/asm/ptrace.h 2010-03-16 09:01:09.972117272 +0800
@@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re
#endif
}

+static inline int user_mode_cs(u16 cs)
+{
+#ifdef CONFIG_X86_32
+ return (cs & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+ return !!(cs & 3);
+#endif
+}
+
static inline int user_mode_vm(struct pt_regs *regs)
{
#ifdef CONFIG_X86_32
diff -Nraup linux-2.6_tipmaster0315/arch/x86/kernel/cpu/perf_event.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kernel/cpu/perf_event.c
--- linux-2.6_tipmaster0315/arch/x86/kernel/cpu/perf_event.c 2010-03-16 08:59:12.225267457 +0800
+++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-03-16 09:03:02.343617673 +0800
@@ -1707,3 +1707,30 @@ void perf_arch_fetch_caller_regs(struct
local_save_flags(regs->flags);
}
EXPORT_SYMBOL_GPL(perf_arch_fetch_caller_regs);
+
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
+ unsigned long ip;
+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ ip = perf_guest_cbs->get_guest_ip();
+ } else
+ ip = instruction_pointer(regs);
+ return ip;
+}
+
+unsigned long perf_misc_flags(struct pt_regs *regs)
+{
+ int misc = 0;
+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ misc |= perf_guest_cbs->is_user_mode() ?
+ PERF_RECORD_MISC_GUEST_USER :
+ PERF_RECORD_MISC_GUEST_KERNEL;
+ } else
+ misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
+ PERF_RECORD_MISC_KERNEL;
+ if (regs->flags & PERF_EFLAGS_EXACT)
+ misc |= PERF_RECORD_MISC_EXACT;
+
+ return misc;
+}
+
diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
--- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800
+++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800
@@ -26,6 +26,7 @@
#include <linux/sched.h>
#include <linux/moduleparam.h>
#include <linux/ftrace_event.h>
+#include <linux/perf_event.h>
#include "kvm_cache_regs.h"
#include "x86.h"

@@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
vmcs_write32(TPR_THRESHOLD, irr);
}

+DEFINE_PER_CPU(int, kvm_in_guest) = {0};
+
+static void kvm_set_in_guest(void)
+{
+ percpu_write(kvm_in_guest, 1);
+}
+
+static int kvm_is_in_guest(void)
+{
+ return percpu_read(kvm_in_guest);
+}
+
+static int kvm_is_user_mode(void)
+{
+ int user_mode;
+ user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR));
+ return user_mode;
+}
+
+static unsigned long kvm_get_guest_ip(void)
+{
+ return vmcs_readl(GUEST_RIP);
+}
+
+static void kvm_reset_in_guest(void)
+{
+ if (percpu_read(kvm_in_guest))
+ percpu_write(kvm_in_guest, 0);
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+ .is_in_guest = kvm_is_in_guest,
+ .is_user_mode = kvm_is_user_mode,
+ .get_guest_ip = kvm_get_guest_ip,
+ .reset_in_guest = kvm_reset_in_guest,
+};
+
static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
{
u32 exit_intr_info;
@@ -3653,8 +3691,11 @@ static void vmx_complete_interrupts(stru

/* We need to handle NMIs before interrupts are enabled */
if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
- (exit_intr_info & INTR_INFO_VALID_MASK))
+ (exit_intr_info & INTR_INFO_VALID_MASK)) {
+ kvm_set_in_guest();
asm("int $2");
+ kvm_reset_in_guest();
+ }

idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;

@@ -4251,6 +4292,8 @@ static int __init vmx_init(void)
if (bypass_guest_pf)
kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull);

+ perf_register_guest_info_callbacks(&kvm_guest_cbs);
+
return 0;

out3:
@@ -4266,6 +4309,8 @@ out:

static void __exit vmx_exit(void)
{
+ perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
+
free_page((unsigned long)vmx_msr_bitmap_legacy);
free_page((unsigned long)vmx_msr_bitmap_longmode);
free_page((unsigned long)vmx_io_bitmap_b);
diff -Nraup linux-2.6_tipmaster0315/include/linux/perf_event.h linux-2.6_tipmaster0315_perfkvm/include/linux/perf_event.h
--- linux-2.6_tipmaster0315/include/linux/perf_event.h 2010-03-16 08:59:21.940168828 +0800
+++ linux-2.6_tipmaster0315_perfkvm/include/linux/perf_event.h 2010-03-16 09:01:09.976084492 +0800
@@ -288,11 +288,13 @@ struct perf_event_mmap_page {
__u64 data_tail; /* user-space written tail */
};

-#define PERF_RECORD_MISC_CPUMODE_MASK (3 << 0)
+#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
#define PERF_RECORD_MISC_CPUMODE_UNKNOWN (0 << 0)
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (2 << 0)
#define PERF_RECORD_MISC_HYPERVISOR (3 << 0)
+#define PERF_RECORD_MISC_GUEST_KERNEL (4 << 0)
+#define PERF_RECORD_MISC_GUEST_USER (5 << 0)

#define PERF_RECORD_MISC_EXACT (1 << 14)
/*
@@ -446,6 +448,13 @@ enum perf_callchain_context {
# include <asm/perf_event.h>
#endif

+struct perf_guest_info_callbacks {
+ int (*is_in_guest) (void);
+ int (*is_user_mode) (void);
+ unsigned long (*get_guest_ip) (void);
+ void (*reset_in_guest) (void);
+};
+
#ifdef CONFIG_HAVE_HW_BREAKPOINT
#include <asm/hw_breakpoint.h>
#endif
@@ -913,6 +922,12 @@ static inline void perf_event_mmap(struc
__perf_event_mmap(vma);
}

+extern struct perf_guest_info_callbacks *perf_guest_cbs;
+extern int perf_register_guest_info_callbacks(
+ struct perf_guest_info_callbacks *);
+extern int perf_unregister_guest_info_callbacks(
+ struct perf_guest_info_callbacks *);
+
extern void perf_event_comm(struct task_struct *tsk);
extern void perf_event_fork(struct task_struct *tsk);

@@ -982,6 +997,11 @@ perf_sw_event(u32 event_id, u64 nr, int
static inline void
perf_bp_event(struct perf_event *event, void *data) { }

+static inline int perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *) {return 0; }
+static inline int perf_unregister_guest_info_callbacks
+(struct perf_guest_info_callbacks *) {return 0; }
+
static inline void perf_event_mmap(struct vm_area_struct *vma) { }
static inline void perf_event_comm(struct task_struct *tsk) { }
static inline void perf_event_fork(struct task_struct *tsk) { }
diff -Nraup linux-2.6_tipmaster0315/kernel/perf_event.c linux-2.6_tipmaster0315_perfkvm/kernel/perf_event.c
--- linux-2.6_tipmaster0315/kernel/perf_event.c 2010-03-16 08:59:55.108431543 +0800
+++ linux-2.6_tipmaster0315_perfkvm/kernel/perf_event.c 2010-03-16 09:01:09.980084394 +0800
@@ -2796,6 +2796,27 @@ void perf_arch_fetch_caller_regs(struct
}

/*
+ * We assume there is only KVM supporting the callbacks.
+ * Later on, we might change it to a list if there is
+ * another virtualization implementation supporting the callbacks.
+ */
+struct perf_guest_info_callbacks *perf_guest_cbs;
+
+int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks * cbs)
+{
+ perf_guest_cbs = cbs;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
+
+int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks * cbs)
+{
+ perf_guest_cbs = NULL;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+
+/*
* Output
*/
static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
@@ -3738,7 +3759,7 @@ void __perf_event_mmap(struct vm_area_st
.event_id = {
.header = {
.type = PERF_RECORD_MMAP,
- .misc = 0,
+ .misc = PERF_RECORD_MISC_USER,
/* .size */
},
/* .pid */
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-diff.c linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-diff.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-diff.c 2010-03-16 08:59:54.736473543 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-diff.c 2010-03-16 10:13:14.620371938 +0800
@@ -33,7 +33,7 @@ static int perf_session__add_hist_entry(
return -ENOMEM;

if (hit)
- he->count += count;
+ __perf_session__add_count(he, al, count);

return 0;
}
@@ -225,6 +225,9 @@ int cmd_diff(int argc, const char **argv
input_new = argv[1];
} else
input_new = argv[0];
+ } else if (symbol_conf.guest_vmlinux_name || symbol_conf.guest_kallsyms) {
+ input_old = "perf.data.host";
+ input_new = "perf.data.guest";
}

symbol_conf.exclude_other = false;
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin.h linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin.h
--- linux-2.6_tipmaster0315/tools/perf/builtin.h 2010-03-16 08:59:54.692509868 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin.h 2010-03-16 09:01:09.980084394 +0800
@@ -32,5 +32,6 @@ extern int cmd_version(int argc, const c
extern int cmd_probe(int argc, const char **argv, const char *prefix);
extern int cmd_kmem(int argc, const char **argv, const char *prefix);
extern int cmd_lock(int argc, const char **argv, const char *prefix);
+extern int cmd_kvm(int argc, const char **argv, const char *prefix);

#endif
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-kvm.c linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-kvm.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-kvm.c 1970-01-01 08:00:00.000000000 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-kvm.c 2010-03-16 09:01:09.980084394 +0800
@@ -0,0 +1,123 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/util.h"
+#include "util/cache.h"
+#include "util/symbol.h"
+#include "util/thread.h"
+#include "util/header.h"
+#include "util/session.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+
+#include "util/debug.h"
+
+#include <sys/prctl.h>
+
+#include <semaphore.h>
+#include <pthread.h>
+#include <math.h>
+
+static char *file_name = NULL;
+static char name_buffer[256];
+
+int perf_host = 1;
+int perf_guest = 0;
+
+static const char * const kvm_usage[] = {
+ "perf kvm [<options>] {top|record|report|diff}",
+ NULL
+};
+
+static const struct option kvm_options[] = {
+ OPT_STRING('i', "input", &file_name, "file",
+ "Input file name"),
+ OPT_STRING('o', "output", &file_name, "file",
+ "Output file name"),
+ OPT_BOOLEAN(0, "guest", &perf_guest,
+ "Collect guest os data"),
+ OPT_BOOLEAN(0, "host", &perf_host,
+ "Collect guest os data"),
+ OPT_STRING(0, "guestvmlinux", &symbol_conf.guest_vmlinux_name, "file",
+ "file saving guest os vmlinux"),
+ OPT_STRING(0, "guestkallsyms", &symbol_conf.guest_kallsyms, "file",
+ "file saving guest os /proc/kallsyms"),
+ OPT_STRING(0, "guestmodules", &symbol_conf.guest_modules, "file",
+ "file saving guest os /proc/modules"),
+ OPT_END()
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+ int rec_argc, i = 0, j;
+ const char **rec_argv;
+
+ rec_argc = argc + 2;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+ rec_argv[i++] = strdup("record");
+ rec_argv[i++] = strdup("-o");
+ rec_argv[i++] = strdup(file_name);
+ for (j = 1; j < argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_record(i, rec_argv, NULL);
+}
+
+static int __cmd_report(int argc, const char **argv)
+{
+ int rec_argc, i = 0, j;
+ const char **rec_argv;
+
+ rec_argc = argc + 2;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+ rec_argv[i++] = strdup("report");
+ rec_argv[i++] = strdup("-i");
+ rec_argv[i++] = strdup(file_name);
+ for (j = 1; j < argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_report(i, rec_argv, NULL);
+}
+
+int cmd_kvm(int argc, const char **argv, const char *prefix __used)
+{
+ perf_host = perf_guest = 0;
+
+ argc = parse_options(argc, argv, kvm_options, kvm_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (!argc)
+ usage_with_options(kvm_usage, kvm_options);
+
+ if (!perf_host)
+ perf_guest = 1;
+
+ if (!file_name) {
+ if (perf_host && !perf_guest)
+ sprintf(name_buffer, "perf.data.host");
+ else if (!perf_host && perf_guest)
+ sprintf(name_buffer, "perf.data.guest");
+ else
+ sprintf(name_buffer, "perf.data.kvm");
+ file_name = name_buffer;
+ }
+
+ if (!strncmp(argv[0], "rec", 3)) {
+ return __cmd_record(argc, argv);
+ } else if (!strncmp(argv[0], "rep", 3)) {
+ return __cmd_report(argc, argv);
+ } else if (!strncmp(argv[0], "diff", 4)) {
+ return cmd_diff(argc, argv, NULL);
+ } else if (!strncmp(argv[0], "top", 3)) {
+ return cmd_top(argc, argv, NULL);
+ } else {
+ usage_with_options(kvm_usage, kvm_options);
+ }
+
+ return 0;
+}
+
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-record.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 08:59:54.896488489 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-record.c 2010-03-16 09:01:09.980084394 +0800
@@ -566,18 +566,58 @@ static int __cmd_record(int argc, const
post_processing_offset = lseek(output, 0, SEEK_CUR);

err = event__synthesize_kernel_mmap(process_synthesized_event,
- session, "_text");
+ session, "/proc/kallsyms",
+ "kernel.kallsyms",
+ session->vmlinux_maps,
+ "_text", PERF_RECORD_MISC_KERNEL);
if (err < 0) {
pr_err("Couldn't record kernel reference relocation symbol.\n");
return err;
}

- err = event__synthesize_modules(process_synthesized_event, session);
+ err = event__synthesize_modules(process_synthesized_event,
+ session,
+ &session->kmaps,
+ PERF_RECORD_MISC_KERNEL);
if (err < 0) {
pr_err("Couldn't record kernel reference relocation symbol.\n");
return err;
}

+ if (perf_guest) {
+ /*
+ *As for guest kernel when processing subcommand record&report,
+ *we arrange module mmap prior to guest kernel mmap and trigger
+ *a preload dso because guest module symbols are loaded from guest
+ *kallsyms instead of /lib/modules/XXX/XXX. This method is used to
+ *avoid symbol missing when the first addr is in module instead of
+ *in guest kernel
+ */
+ err = event__synthesize_modules(process_synthesized_event,
+ session,
+ &session->guest_kmaps,
+ PERF_RECORD_MISC_GUEST_KERNEL);
+ if (err < 0) {
+ pr_err("Couldn't record guest kernel reference relocation symbol.\n");
+ return err;
+ }
+
+ /*
+ * We use _stext for guest kernel because guest kernel's /proc/kallsyms
+ * have no _text.
+ */
+ err = event__synthesize_kernel_mmap(process_synthesized_event,
+ session, symbol_conf.guest_kallsyms,
+ "guest.kernel.kallsyms",
+ session->guest_vmlinux_maps,
+ "_stext",
+ PERF_RECORD_MISC_GUEST_KERNEL);
+ if (err < 0) {
+ pr_err("Couldn't record guest kernel reference relocation symbol.\n");
+ return err;
+ }
+ }
+
if (!system_wide && profile_cpu == -1)
event__synthesize_thread(target_pid, process_synthesized_event,
session);
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-report.c linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-report.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-report.c 2010-03-16 08:59:54.760470652 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-report.c 2010-03-16 10:40:24.102800324 +0800
@@ -104,7 +104,7 @@ static int perf_session__add_hist_entry(
return -ENOMEM;

if (hit)
- he->count += data->period;
+ __perf_session__add_count(he, al, data->period);

if (symbol_conf.use_callchain) {
if (!hit)
@@ -428,6 +428,8 @@ static const struct option options[] = {
"sort by key(s): pid, comm, dso, symbol, parent"),
OPT_BOOLEAN('P', "full-paths", &symbol_conf.full_paths,
"Don't shorten the pathnames taking into account the cwd"),
+ OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
+ "Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", &parent_pattern, "regex",
"regex filter to identify parent, see: '--sort parent'"),
OPT_BOOLEAN('x', "exclude-other", &symbol_conf.exclude_other,
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-top.c linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-top.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-top.c 2010-03-16 08:59:54.760470652 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/builtin-top.c 2010-03-16 09:01:09.984084103 +0800
@@ -417,8 +417,9 @@ static double sym_weight(const struct sy
}

static long samples;
-static long userspace_samples;
+static long kernel_samples, userspace_samples;
static long exact_samples;
+static long guest_us_samples, guest_kernel_samples;
static const char CONSOLE_CLEAR[] = " [H [2J";

static void __list_insert_active_sym(struct sym_entry *syme)
@@ -458,7 +459,10 @@ static void print_sym_table(void)
int printed = 0, j;
int counter, snap = !display_weighted ? sym_counter : 0;
float samples_per_sec = samples/delay_secs;
- float ksamples_per_sec = (samples-userspace_samples)/delay_secs;
+ float ksamples_per_sec = kernel_samples/delay_secs;
+ float userspace_samples_per_sec = (userspace_samples)/delay_secs;
+ float guest_kernel_samples_per_sec = (guest_kernel_samples)/delay_secs;
+ float guest_us_samples_per_sec = (guest_us_samples)/delay_secs;
float esamples_percent = (100.0*exact_samples)/samples;
float sum_ksamples = 0.0;
struct sym_entry *syme, *n;
@@ -467,7 +471,8 @@ static void print_sym_table(void)
int sym_width = 0, dso_width = 0, max_dso_width;
const int win_width = winsize.ws_col - 1;

- samples = userspace_samples = exact_samples = 0;
+ samples = userspace_samples = kernel_samples = exact_samples = 0;
+ guest_kernel_samples = guest_us_samples = 0;

/* Sort the active symbols */
pthread_mutex_lock(&active_symbols_lock);
@@ -498,10 +503,21 @@ static void print_sym_table(void)
puts(CONSOLE_CLEAR);

printf("%-*.*s\n", win_width, win_width, graph_dotted_line);
- printf( " PerfTop:%8.0f irqs/sec kernel:%4.1f%% exact: %4.1f%% [",
- samples_per_sec,
- 100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
- esamples_percent);
+ if (!perf_guest) {
+ printf( " PerfTop:%8.0f irqs/sec kernel:%4.1f%% exact: %4.1f%% [",
+ samples_per_sec,
+ 100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+ esamples_percent);
+ } else {
+ printf( " PerfTop:%8.0f irqs/sec kernel:%4.1f%% us:%4.1f%%"
+ " guest kernel:%4.1f%% guest us:%4.1f%% exact: %4.1f%% [",
+ samples_per_sec,
+ 100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+ 100.0 - (100.0*((samples_per_sec-userspace_samples_per_sec)/samples_per_sec)),
+ 100.0 - (100.0*((samples_per_sec-guest_kernel_samples_per_sec)/samples_per_sec)),
+ 100.0 - (100.0*((samples_per_sec-guest_us_samples_per_sec)/samples_per_sec)),
+ esamples_percent);
+ }

if (nr_counters == 1 || !display_weighted) {
printf("%Ld", (u64)attrs[0].sample_period);
@@ -958,9 +974,20 @@ static void event__process_sample(const
return;
break;
case PERF_RECORD_MISC_KERNEL:
+ ++kernel_samples;
if (hide_kernel_symbols)
return;
break;
+ case PERF_RECORD_MISC_GUEST_KERNEL:
+ ++guest_kernel_samples;
+ break;
+ case PERF_RECORD_MISC_GUEST_USER:
+ ++guest_us_samples;
+ /*
+ * TODO: we don't process guest user from host side
+ * except simple counting
+ */
+ return;
default:
return;
}
diff -Nraup linux-2.6_tipmaster0315/tools/perf/Makefile linux-2.6_tipmaster0315_perfkvm/tools/perf/Makefile
--- linux-2.6_tipmaster0315/tools/perf/Makefile 2010-03-16 08:59:54.892460680 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/Makefile 2010-03-16 10:45:19.503860691 +0800
@@ -462,6 +462,7 @@ BUILTIN_OBJS += builtin-trace.o
BUILTIN_OBJS += builtin-probe.o
BUILTIN_OBJS += builtin-kmem.o
BUILTIN_OBJS += builtin-lock.o
+BUILTIN_OBJS += builtin-kvm.o

PERFLIBS = $(LIB_FILE)

diff -Nraup linux-2.6_tipmaster0315/tools/perf/perf.c linux-2.6_tipmaster0315_perfkvm/tools/perf/perf.c
--- linux-2.6_tipmaster0315/tools/perf/perf.c 2010-03-16 08:59:54.764469663 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/perf.c 2010-03-16 09:01:09.984084103 +0800
@@ -308,6 +308,7 @@ static void handle_internal_command(int
{ "probe", cmd_probe, 0 },
{ "kmem", cmd_kmem, 0 },
{ "lock", cmd_lock, 0 },
+ { "kvm", cmd_kvm, 0 },
};
unsigned int i;
static const char ext[] = STRIP_EXTENSION;
diff -Nraup linux-2.6_tipmaster0315/tools/perf/perf.h linux-2.6_tipmaster0315_perfkvm/tools/perf/perf.h
--- linux-2.6_tipmaster0315/tools/perf/perf.h 2010-03-16 08:59:54.896488489 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/perf.h 2010-03-16 09:01:10.000116335 +0800
@@ -133,4 +133,6 @@ struct ip_callchain {
u64 ips[0];
};

+extern int perf_host, perf_guest;
+
#endif
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/event.c linux-2.6_tipmaster0315_perfkvm/tools/perf/util/event.c
--- linux-2.6_tipmaster0315/tools/perf/util/event.c 2010-03-16 08:59:54.864459297 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/event.c 2010-03-16 09:45:19.660852164 +0800
@@ -112,7 +112,7 @@ static int event__synthesize_mmap_events
event_t ev = {
.header = {
.type = PERF_RECORD_MMAP,
- .misc = 0, /* Just like the kernel, see kernel/perf_event.c __perf_event_mmap */
+ .misc = PERF_RECORD_MISC_USER, /* Just like the kernel, see kernel/perf_event.c __perf_event_mmap */
},
};
int n;
@@ -158,11 +158,13 @@ static int event__synthesize_mmap_events
}

int event__synthesize_modules(event__handler_t process,
- struct perf_session *session)
+ struct perf_session *session,
+ struct map_groups *kmaps,
+ unsigned int misc)
{
struct rb_node *nd;

- for (nd = rb_first(&session->kmaps.maps[MAP__FUNCTION]);
+ for (nd = rb_first(&kmaps->maps[MAP__FUNCTION]);
nd; nd = rb_next(nd)) {
event_t ev;
size_t size;
@@ -173,7 +175,7 @@ int event__synthesize_modules(event__han

size = ALIGN(pos->dso->long_name_len + 1, sizeof(u64));
memset(&ev, 0, sizeof(ev));
- ev.mmap.header.misc = 1; /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
+ ev.mmap.header.misc = misc; /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
ev.mmap.header.type = PERF_RECORD_MMAP;
ev.mmap.header.size = (sizeof(ev.mmap) -
(sizeof(ev.mmap.filename) - size));
@@ -241,13 +243,17 @@ static int find_symbol_cb(void *arg, con

int event__synthesize_kernel_mmap(event__handler_t process,
struct perf_session *session,
- const char *symbol_name)
+ const char *kallsyms_name,
+ const char *mmap_name,
+ struct map **maps,
+ const char *symbol_name,
+ unsigned int misc)
{
size_t size;
event_t ev = {
.header = {
.type = PERF_RECORD_MMAP,
- .misc = 1, /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
+ .misc = misc, /* kernel uses PERF_RECORD_MISC_USER for user space maps, see kernel/perf_event.c __perf_event_mmap */
},
};
/*
@@ -257,16 +263,16 @@ int event__synthesize_kernel_mmap(event_
*/
struct process_symbol_args args = { .name = symbol_name, };

- if (kallsyms__parse("/proc/kallsyms", &args, find_symbol_cb) <= 0)
+ if (kallsyms__parse(kallsyms_name, &args, find_symbol_cb) <= 0)
return -ENOENT;

size = snprintf(ev.mmap.filename, sizeof(ev.mmap.filename),
- "[kernel.kallsyms.%s]", symbol_name) + 1;
+ "[%s.%s]", mmap_name, symbol_name) + 1;
size = ALIGN(size, sizeof(u64));
ev.mmap.header.size = (sizeof(ev.mmap) - (sizeof(ev.mmap.filename) - size));
ev.mmap.pgoff = args.start;
- ev.mmap.start = session->vmlinux_maps[MAP__FUNCTION]->start;
- ev.mmap.len = session->vmlinux_maps[MAP__FUNCTION]->end - ev.mmap.start ;
+ ev.mmap.start = maps[MAP__FUNCTION]->start;
+ ev.mmap.len = maps[MAP__FUNCTION]->end - ev.mmap.start ;

return process(&ev, session);
}
@@ -320,19 +326,25 @@ int event__process_lost(event_t *self, s
return 0;
}

-int event__process_mmap(event_t *self, struct perf_session *session)
+static void event_set_kernel_mmap_len(struct map **maps, event_t *self)
{
- struct thread *thread;
- struct map *map;
+ maps[MAP__FUNCTION]->start = self->mmap.start;
+ maps[MAP__FUNCTION]->end = self->mmap.start + self->mmap.len;
+ /*
+ * Be a bit paranoid here, some perf.data file came with
+ * a zero sized synthesized MMAP event for the kernel.
+ */
+ if (maps[MAP__FUNCTION]->end == 0)
+ maps[MAP__FUNCTION]->end = ~0UL;
+}

- dump_printf(" %d/%d: [%#Lx(%#Lx) @ %#Lx]: %s\n",
- self->mmap.pid, self->mmap.tid, self->mmap.start,
- self->mmap.len, self->mmap.pgoff, self->mmap.filename);
+static int __event__process_mmap(event_t *self, struct perf_session *session)
+{
+ struct map *map;
+ static const char kmmap_prefix[] = "[kernel.kallsyms.";

- if (self->mmap.pid == 0) {
- static const char kmmap_prefix[] = "[kernel.kallsyms.";
+ if (self->mmap.filename[0] == '/') {

- if (self->mmap.filename[0] == '/') {
char short_module_name[1024];
char *name = strrchr(self->mmap.filename, '/'), *dot;

@@ -348,9 +360,10 @@ int event__process_mmap(event_t *self, s
"[%.*s]", (int)(dot - name), name);
strxfrchar(short_module_name, '-', '_');

- map = perf_session__new_module_map(session,
+ map = map_groups__new_module(&session->kmaps,
self->mmap.start,
- self->mmap.filename);
+ self->mmap.filename,
+ 0);
if (map == NULL)
goto out_problem;

@@ -373,22 +386,94 @@ int event__process_mmap(event_t *self, s
if (kernel == NULL)
goto out_problem;

- kernel->kernel = 1;
- if (__perf_session__create_kernel_maps(session, kernel) < 0)
+ kernel->kernel = DSO_TYPE_KERNEL;
+ if (__map_groups__create_kernel_maps(&session->kmaps,
+ session->vmlinux_maps, kernel) < 0)
goto out_problem;

- session->vmlinux_maps[MAP__FUNCTION]->start = self->mmap.start;
- session->vmlinux_maps[MAP__FUNCTION]->end = self->mmap.start + self->mmap.len;
- /*
- * Be a bit paranoid here, some perf.data file came with
- * a zero sized synthesized MMAP event for the kernel.
- */
- if (session->vmlinux_maps[MAP__FUNCTION]->end == 0)
- session->vmlinux_maps[MAP__FUNCTION]->end = ~0UL;
-
- perf_session__set_kallsyms_ref_reloc_sym(session, symbol_name,
- self->mmap.pgoff);
+ event_set_kernel_mmap_len(session->vmlinux_maps, self);
+ perf_session__set_kallsyms_ref_reloc_sym(session->vmlinux_maps,
+ symbol_name,
+ self->mmap.pgoff);
}
+ return 0;
+
+out_problem:
+ return -1;
+}
+
+static int __event__process_guest_mmap(event_t *self, struct perf_session *session)
+{
+ struct map *map;
+
+ static const char kmmap_prefix[] = "[guest.kernel.kallsyms.";
+
+ if (memcmp(self->mmap.filename, kmmap_prefix,
+ sizeof(kmmap_prefix) - 1) == 0) {
+ const char *symbol_name = (self->mmap.filename +
+ sizeof(kmmap_prefix) - 1);
+ /*
+ * Should be there already, from the build-id table in
+ * the header.
+ */
+ struct dso *kernel = __dsos__findnew(&dsos__guest_kernel,
+ "[guest.kernel.kallsyms]");
+ if (kernel == NULL)
+ goto out_problem;
+
+ kernel->kernel = DSO_TYPE_GUEST_KERNEL;
+ if (__map_groups__create_kernel_maps(&session->guest_kmaps,
+ session->guest_vmlinux_maps, kernel) < 0)
+ goto out_problem;
+
+ event_set_kernel_mmap_len(session->guest_vmlinux_maps, self);
+ perf_session__set_kallsyms_ref_reloc_sym(session->guest_vmlinux_maps,
+ symbol_name,
+ self->mmap.pgoff);
+ /*
+ * preload dso of guest kernel and modules
+ */
+ dso__load(kernel, session->guest_vmlinux_maps[MAP__FUNCTION], NULL);
+ } else if (self->mmap.filename[0] == '[') {
+ char *name;
+
+ map = map_groups__new_module(&session->guest_kmaps,
+ self->mmap.start,
+ self->mmap.filename,
+ 1);
+ if (map == NULL)
+ goto out_problem;
+ name = strdup(self->mmap.filename);
+ if (name == NULL)
+ goto out_problem;
+
+ map->dso->short_name = name;
+ map->end = map->start + self->mmap.len;
+ }
+
+ return 0;
+out_problem:
+ return -1;
+}
+
+int event__process_mmap(event_t *self, struct perf_session *session)
+{
+ struct thread *thread;
+ struct map *map;
+ u8 cpumode = self->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ int ret;
+
+ dump_printf(" %d/%d: [%#Lx(%#Lx) @ %#Lx]: %s\n",
+ self->mmap.pid, self->mmap.tid, self->mmap.start,
+ self->mmap.len, self->mmap.pgoff, self->mmap.filename);
+
+ if (self->mmap.pid == 0) {
+ if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL)
+ ret = __event__process_guest_mmap(self, session);
+ else
+ ret = __event__process_mmap(self, session);
+ if (ret < 0)
+ goto out_problem;
return 0;
}

@@ -441,15 +526,33 @@ void thread__find_addr_map(struct thread

al->thread = self;
al->addr = addr;
+ al->cpumode = cpumode;

- if (cpumode == PERF_RECORD_MISC_KERNEL) {
+ if (cpumode == PERF_RECORD_MISC_KERNEL && perf_host) {
al->level = 'k';
mg = &session->kmaps;
- } else if (cpumode == PERF_RECORD_MISC_USER)
+ } else if (cpumode == PERF_RECORD_MISC_USER && perf_host) {
al->level = '.';
- else {
- al->level = 'H';
+ } else if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL && perf_guest) {
+ al->level = 'g';
+ mg = &session->guest_kmaps;
+ } else {
+ /* TODO: We don't support guest user space. Might support late */
+ if (cpumode == PERF_RECORD_MISC_GUEST_USER && perf_guest)
+ al->level = 'u';
+ else
+ al->level = 'H';
al->map = NULL;
+
+ if ((cpumode == PERF_RECORD_MISC_GUEST_USER ||
+ cpumode == PERF_RECORD_MISC_GUEST_KERNEL) &&
+ !perf_guest)
+ al->filtered = true;
+ if ((cpumode == PERF_RECORD_MISC_USER ||
+ cpumode == PERF_RECORD_MISC_KERNEL) &&
+ !perf_host)
+ al->filtered = true;
+
return;
}
try_again:
@@ -464,10 +567,18 @@ try_again:
* "[vdso]" dso, but for now lets use the old trick of looking
* in the whole kernel symbol list.
*/
- if ((long long)al->addr < 0 && mg != &session->kmaps) {
+ if ((long long)al->addr < 0 &&
+ mg != &session->kmaps &&
+ cpumode == PERF_RECORD_MISC_KERNEL) {
mg = &session->kmaps;
goto try_again;
}
+ if ((long long)al->addr < 0 &&
+ mg != &session->guest_kmaps &&
+ cpumode == PERF_RECORD_MISC_GUEST_KERNEL) {
+ mg = &session->guest_kmaps;
+ goto try_again;
+ }
} else
al->addr = al->map->map_ip(al->map, al->addr);
}
@@ -513,6 +624,7 @@ int event__preprocess_sample(const event

dump_printf(" ... thread: %s:%d\n", thread->comm, thread->pid);

+ al->filtered = false;
thread__find_addr_location(thread, session, cpumode, MAP__FUNCTION,
self->ip.ip, al, filter);
dump_printf(" ...... dso: %s\n",
@@ -536,7 +648,6 @@ int event__preprocess_sample(const event
!strlist__has_entry(symbol_conf.sym_list, al->sym->name))
goto out_filtered;

- al->filtered = false;
return 0;

out_filtered:
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/event.h linux-2.6_tipmaster0315_perfkvm/tools/perf/util/event.h
--- linux-2.6_tipmaster0315/tools/perf/util/event.h 2010-03-16 08:59:54.856460879 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/event.h 2010-03-16 09:01:10.000116335 +0800
@@ -119,10 +119,17 @@ int event__synthesize_thread(pid_t pid,
void event__synthesize_threads(event__handler_t process,
struct perf_session *session);
int event__synthesize_kernel_mmap(event__handler_t process,
- struct perf_session *session,
- const char *symbol_name);
+ struct perf_session *session,
+ const char *kallsyms_name,
+ const char *mmap_name,
+ struct map **maps,
+ const char *symbol_name,
+ unsigned int misc);
+
int event__synthesize_modules(event__handler_t process,
- struct perf_session *session);
+ struct perf_session *session,
+ struct map_groups *kmaps,
+ unsigned int misc);

int event__process_comm(event_t *self, struct perf_session *session);
int event__process_lost(event_t *self, struct perf_session *session);
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/hist.c linux-2.6_tipmaster0315_perfkvm/tools/perf/util/hist.c
--- linux-2.6_tipmaster0315/tools/perf/util/hist.c 2010-03-16 08:59:54.880462306 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/hist.c 2010-03-16 10:44:18.228997471 +0800
@@ -8,6 +8,30 @@ struct callchain_param callchain_param =
.min_percent = 0.5
};

+void __perf_session__add_count(struct hist_entry *he,
+ struct addr_location *al,
+ u64 count)
+{
+ he->count += count;
+
+ switch (al->cpumode) {
+ case PERF_RECORD_MISC_KERNEL:
+ he->count_sys += count;
+ break;
+ case PERF_RECORD_MISC_USER:
+ he->count_us += count;
+ break;
+ case PERF_RECORD_MISC_GUEST_KERNEL:
+ he->count_guest_sys += count;
+ break;
+ case PERF_RECORD_MISC_GUEST_USER:
+ he->count_guest_us += count;
+ break;
+ default:
+ break;
+ }
+}
+
/*
* histogram, sorted on item, collects counts
*/
@@ -26,7 +50,6 @@ struct hist_entry *__perf_session__add_h
.sym = al->sym,
.ip = al->addr,
.level = al->level,
- .count = count,
.parent = sym_parent,
};
int cmp;
@@ -48,6 +71,8 @@ struct hist_entry *__perf_session__add_h
p = &(*p)->rb_right;
}

+ __perf_session__add_count(&entry, al, count);
+
he = malloc(sizeof(*he));
if (!he)
return NULL;
@@ -462,7 +487,7 @@ size_t hist_entry__fprintf(struct hist_e
u64 session_total)
{
struct sort_entry *se;
- u64 count, total;
+ u64 count, total, count_sys, count_us, count_guest_sys, count_guest_us;
const char *sep = symbol_conf.field_sep;
size_t ret;

@@ -472,15 +497,35 @@ size_t hist_entry__fprintf(struct hist_e
if (pair_session) {
count = self->pair ? self->pair->count : 0;
total = pair_session->events_stats.total;
+ count_sys = self->pair ? self->pair->count_sys : 0;
+ count_us = self->pair ? self->pair->count_us : 0;
+ count_guest_sys = self->pair ? self->pair->count_guest_sys : 0;
+ count_guest_us = self->pair ? self->pair->count_guest_us : 0;
} else {
count = self->count;
total = session_total;
+ count_sys = self->count_sys;
+ count_us = self->count_us;
+ count_guest_sys = self->count_guest_sys;
+ count_guest_us = self->count_guest_us;
}

- if (total)
+ if (total) {
ret = percent_color_fprintf(fp, sep ? "%.2f" : " %6.2f%%",
(count * 100.0) / total);
- else
+ if (symbol_conf.show_cpu_utilization) {
+ ret += percent_color_fprintf(fp, sep ? "%.2f" : " %6.2f%%",
+ (count_sys * 100.0) / total);
+ ret += percent_color_fprintf(fp, sep ? "%.2f" : " %6.2f%%",
+ (count_us * 100.0) / total);
+ if (perf_guest) {
+ ret += percent_color_fprintf(fp, sep ? "%.2f" : " %6.2f%%",
+ (count_guest_sys * 100.0) / total);
+ ret += percent_color_fprintf(fp, sep ? "%.2f" : " %6.2f%%",
+ (count_guest_us * 100.0) / total);
+ }
+ }
+ } else
ret = fprintf(fp, sep ? "%lld" : "%12lld ", count);

if (symbol_conf.show_nr_samples) {
@@ -576,6 +621,20 @@ size_t perf_session__fprintf_hists(struc
fputs(" Samples ", fp);
}

+ if (symbol_conf.show_cpu_utilization) {
+ if (sep) {
+ ret += fprintf(fp, "%csys", *sep);
+ ret += fprintf(fp, "%cus", *sep);
+ ret += fprintf(fp, "%cguest sys", *sep);
+ ret += fprintf(fp, "%cguest us", *sep);
+ } else {
+ ret += fprintf(fp, " sys ");
+ ret += fprintf(fp, " us ");
+ ret += fprintf(fp, " guest sys ");
+ ret += fprintf(fp, " guest us ");
+ }
+ }
+
if (pair) {
if (sep)
ret += fprintf(fp, "%cDelta", *sep);
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/hist.h linux-2.6_tipmaster0315_perfkvm/tools/perf/util/hist.h
--- linux-2.6_tipmaster0315/tools/perf/util/hist.h 2010-03-16 08:59:54.868491838 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/hist.h 2010-03-16 10:11:24.744056043 +0800
@@ -12,6 +12,9 @@ struct addr_location;
struct symbol;
struct rb_root;

+void __perf_session__add_count(struct hist_entry *he,
+ struct addr_location *al,
+ u64 count);
struct hist_entry *__perf_session__add_hist_entry(struct rb_root *hists,
struct addr_location *al,
struct symbol *parent,
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/session.c linux-2.6_tipmaster0315_perfkvm/tools/perf/util/session.c
--- linux-2.6_tipmaster0315/tools/perf/util/session.c 2010-03-16 08:59:54.888458734 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/session.c 2010-03-16 09:01:10.000116335 +0800
@@ -54,7 +54,12 @@ out_close:

static inline int perf_session__create_kernel_maps(struct perf_session *self)
{
- return map_groups__create_kernel_maps(&self->kmaps, self->vmlinux_maps);
+ int ret;
+ ret = map_groups__create_kernel_maps(&self->kmaps, self->vmlinux_maps);
+ if (ret >= 0)
+ ret = map_groups__create_guest_kernel_maps(&self->guest_kmaps,
+ self->guest_vmlinux_maps);
+ return ret;
}

struct perf_session *perf_session__new(const char *filename, int mode, bool force)
@@ -77,6 +82,7 @@ struct perf_session *perf_session__new(c
self->cwdlen = 0;
self->unknown_events = 0;
map_groups__init(&self->kmaps);
+ map_groups__init(&self->guest_kmaps);

if (mode == O_RDONLY) {
if (perf_session__open(self, force) < 0)
@@ -356,7 +362,8 @@ int perf_header__read_build_ids(struct p
if (read(input, filename, len) != len)
goto out;

- if (bev.header.misc & PERF_RECORD_MISC_KERNEL)
+ if ((bev.header.misc & PERF_RECORD_MISC_CPUMODE_MASK)
+ == PERF_RECORD_MISC_KERNEL)
head = &dsos__kernel;

dso = __dsos__findnew(head, filename);
@@ -519,26 +526,33 @@ bool perf_session__has_traces(struct per
return true;
}

-int perf_session__set_kallsyms_ref_reloc_sym(struct perf_session *self,
+int perf_session__set_kallsyms_ref_reloc_sym(struct map ** maps,
const char *symbol_name,
u64 addr)
{
char *bracket;
enum map_type i;
+ struct ref_reloc_sym *ref;

- self->ref_reloc_sym.name = strdup(symbol_name);
- if (self->ref_reloc_sym.name == NULL)
+ ref = zalloc(sizeof(struct ref_reloc_sym));
+ if (ref == NULL)
return -ENOMEM;

- bracket = strchr(self->ref_reloc_sym.name, ']');
+ ref->name = strdup(symbol_name);
+ if (ref->name == NULL) {
+ free(ref);
+ return -ENOMEM;
+ }
+
+ bracket = strchr(ref->name, ']');
if (bracket)
*bracket = '\0';

- self->ref_reloc_sym.addr = addr;
+ ref->addr = addr;

for (i = 0; i < MAP__NR_TYPES; ++i) {
- struct kmap *kmap = map__kmap(self->vmlinux_maps[i]);
- kmap->ref_reloc_sym = &self->ref_reloc_sym;
+ struct kmap *kmap = map__kmap(maps[i]);
+ kmap->ref_reloc_sym = ref;
}

return 0;
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/session.h linux-2.6_tipmaster0315_perfkvm/tools/perf/util/session.h
--- linux-2.6_tipmaster0315/tools/perf/util/session.h 2010-03-16 08:59:54.768472278 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/session.h 2010-03-16 09:04:50.827525867 +0800
@@ -16,16 +16,17 @@ struct perf_session {
unsigned long size;
unsigned long mmap_window;
struct map_groups kmaps;
+ struct map_groups guest_kmaps;
struct rb_root threads;
struct thread *last_match;
struct map *vmlinux_maps[MAP__NR_TYPES];
+ struct map *guest_vmlinux_maps[MAP__NR_TYPES];
struct events_stats events_stats;
struct rb_root stats_by_id;
unsigned long event_total[PERF_RECORD_MAX];
unsigned long unknown_events;
struct rb_root hists;
u64 sample_type;
- struct ref_reloc_sym ref_reloc_sym;
int fd;
int cwdlen;
char *cwd;
@@ -67,26 +68,12 @@ bool perf_session__has_traces(struct per
int perf_header__read_build_ids(struct perf_header *self, int input,
u64 offset, u64 file_size);

-int perf_session__set_kallsyms_ref_reloc_sym(struct perf_session *self,
+int perf_session__set_kallsyms_ref_reloc_sym(struct map ** maps,
const char *symbol_name,
u64 addr);

void mem_bswap_64(void *src, int byte_size);

-static inline int __perf_session__create_kernel_maps(struct perf_session *self,
- struct dso *kernel)
-{
- return __map_groups__create_kernel_maps(&self->kmaps,
- self->vmlinux_maps, kernel);
-}
-
-static inline struct map *
- perf_session__new_module_map(struct perf_session *self,
- u64 start, const char *filename)
-{
- return map_groups__new_module(&self->kmaps, start, filename);
-}
-
#ifdef NO_NEWT_SUPPORT
static inline void perf_session__browse_hists(struct rb_root *hists __used,
u64 session_total __used,
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/sort.h linux-2.6_tipmaster0315_perfkvm/tools/perf/util/sort.h
--- linux-2.6_tipmaster0315/tools/perf/util/sort.h 2010-03-16 08:59:54.780505450 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/sort.h 2010-03-16 09:46:38.997734739 +0800
@@ -44,6 +44,10 @@ extern enum sort_type sort__first_dimens
struct hist_entry {
struct rb_node rb_node;
u64 count;
+ u64 count_sys;
+ u64 count_us;
+ u64 count_guest_sys;
+ u64 count_guest_us;
struct thread *thread;
struct map *map;
struct symbol *sym;
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/symbol.c linux-2.6_tipmaster0315_perfkvm/tools/perf/util/symbol.c
--- linux-2.6_tipmaster0315/tools/perf/util/symbol.c 2010-03-16 08:59:54.784503211 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/symbol.c 2010-03-16 10:47:03.587519946 +0800
@@ -22,6 +22,8 @@ static void dsos__add(struct list_head *
static struct map *map__new2(u64 start, struct dso *dso, enum map_type type);
static int dso__load_kernel_sym(struct dso *self, struct map *map,
symbol_filter_t filter);
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+ symbol_filter_t filter);
static int vmlinux_path__nr_entries;
static char **vmlinux_path;

@@ -172,6 +174,7 @@ struct dso *dso__new(const char *name)
self->loaded = 0;
self->sorted_by_name = 0;
self->has_build_id = 0;
+ self->kernel = DSO_TYPE_USER;
}

return self;
@@ -388,12 +391,9 @@ int kallsyms__parse(const char *filename
char *symbol_name;

line_len = getline(&line, &n, file);
- if (line_len < 0)
+ if (line_len < 0 || !line)
break;

- if (!line)
- goto out_failure;
-
line[--line_len] = '\0'; /* \n */

len = hex2u64(line, &start);
@@ -445,6 +445,7 @@ static int map__process_kallsym_symbol(v
* map__split_kallsyms, when we have split the maps per module
*/
symbols__insert(root, sym);
+
return 0;
}

@@ -490,6 +491,15 @@ static int dso__split_kallsyms(struct ds
*module++ = '\0';

if (strcmp(curr_map->dso->short_name, module)) {
+ if (curr_map != map &&
+ self->kernel == DSO_TYPE_GUEST_KERNEL) {
+ /*
+ * We assume all symbols of a module are continuous in
+ * kallsyms, so curr_map points to a module and all its
+ * symbols are in its kmap. Mark it as loaded.
+ */
+ dso__set_loaded(curr_map->dso, curr_map->type);
+ }
curr_map = map_groups__find_by_name(kmaps, map->type, module);
if (curr_map == NULL) {
pr_debug("/proc/{kallsyms,modules} "
@@ -511,13 +521,19 @@ static int dso__split_kallsyms(struct ds
char dso_name[PATH_MAX];
struct dso *dso;

- snprintf(dso_name, sizeof(dso_name), "[kernel].%d",
- kernel_range++);
+ if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+ snprintf(dso_name, sizeof(dso_name), "[guest.kernel].%d",
+ kernel_range++);
+ else
+ snprintf(dso_name, sizeof(dso_name), "[kernel].%d",
+ kernel_range++);

dso = dso__new(dso_name);
if (dso == NULL)
return -1;

+ dso->kernel = self->kernel;
+
curr_map = map__new2(pos->start, dso, map->type);
if (curr_map == NULL) {
dso__delete(dso);
@@ -541,6 +557,10 @@ discard_symbol: rb_erase(&pos->rb_node,
}
}

+ if (curr_map != map &&
+ self->kernel == DSO_TYPE_GUEST_KERNEL)
+ dso__set_loaded(curr_map->dso, curr_map->type);
+
return count;
}

@@ -551,7 +571,10 @@ int dso__load_kallsyms(struct dso *self,
return -1;

symbols__fixup_end(&self->symbols[map->type]);
- self->origin = DSO__ORIG_KERNEL;
+ if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+ self->origin = DSO__ORIG_GUEST_KERNEL;
+ else
+ self->origin = DSO__ORIG_KERNEL;

return dso__split_kallsyms(self, map, filter);
}
@@ -939,7 +962,7 @@ static int dso__load_sym(struct dso *sel
nr_syms = shdr.sh_size / shdr.sh_entsize;

memset(&sym, 0, sizeof(sym));
- if (!self->kernel) {
+ if (self->kernel == DSO_TYPE_USER) {
self->adjust_symbols = (ehdr.e_type == ET_EXEC ||
elf_section_by_name(elf, &ehdr, &shdr,
".gnu.prelink_undo",
@@ -971,7 +994,7 @@ static int dso__load_sym(struct dso *sel

section_name = elf_sec__name(&shdr, secstrs);

- if (self->kernel || kmodule) {
+ if (self->kernel != DSO_TYPE_USER || kmodule) {
char dso_name[PATH_MAX];

if (strcmp(section_name,
@@ -997,6 +1020,7 @@ static int dso__load_sym(struct dso *sel
curr_dso = dso__new(dso_name);
if (curr_dso == NULL)
goto out_elf_end;
+ curr_dso->kernel = self->kernel;
curr_map = map__new2(start, curr_dso,
map->type);
if (curr_map == NULL) {
@@ -1007,7 +1031,10 @@ static int dso__load_sym(struct dso *sel
curr_map->unmap_ip = identity__map_ip;
curr_dso->origin = self->origin;
map_groups__insert(kmap->kmaps, curr_map);
- dsos__add(&dsos__kernel, curr_dso);
+ if (curr_dso->kernel == DSO_TYPE_GUEST_KERNEL)
+ dsos__add(&dsos__guest_kernel, curr_dso);
+ else
+ dsos__add(&dsos__kernel, curr_dso);
dso__set_loaded(curr_dso, map->type);
} else
curr_dso = curr_map->dso;
@@ -1228,6 +1255,8 @@ char dso__symtab_origin(const struct dso
[DSO__ORIG_BUILDID] = 'b',
[DSO__ORIG_DSO] = 'd',
[DSO__ORIG_KMODULE] = 'K',
+ [DSO__ORIG_GUEST_KERNEL] = 'g',
+ [DSO__ORIG_GUEST_KMODULE] = 'G',
};

if (self == NULL || self->origin == DSO__ORIG_NOT_FOUND)
@@ -1246,8 +1275,10 @@ int dso__load(struct dso *self, struct m

dso__set_loaded(self, map->type);

- if (self->kernel)
+ if (self->kernel == DSO_TYPE_KERNEL)
return dso__load_kernel_sym(self, map, filter);
+ else if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+ return dso__load_guest_kernel_sym(self, map, filter);

name = malloc(size);
if (!name)
@@ -1451,7 +1482,7 @@ static int map_groups__set_modules_path(
static struct map *map__new2(u64 start, struct dso *dso, enum map_type type)
{
struct map *self = zalloc(sizeof(*self) +
- (dso->kernel ? sizeof(struct kmap) : 0));
+ (dso->kernel != DSO_TYPE_USER ? sizeof(struct kmap) : 0));
if (self != NULL) {
/*
* ->end will be filled after we load all the symbols
@@ -1463,11 +1494,15 @@ static struct map *map__new2(u64 start,
}

struct map *map_groups__new_module(struct map_groups *self, u64 start,
- const char *filename)
+ const char *filename, int guest)
{
struct map *map;
- struct dso *dso = __dsos__findnew(&dsos__kernel, filename);
+ struct dso *dso;

+ if (!guest)
+ dso = __dsos__findnew(&dsos__kernel, filename);
+ else
+ dso = __dsos__findnew(&dsos__guest_kernel, filename);
if (dso == NULL)
return NULL;

@@ -1475,16 +1510,20 @@ struct map *map_groups__new_module(struc
if (map == NULL)
return NULL;

- dso->origin = DSO__ORIG_KMODULE;
+ if (guest)
+ dso->origin = DSO__ORIG_GUEST_KMODULE;
+ else
+ dso->origin = DSO__ORIG_KMODULE;
map_groups__insert(self, map);
return map;
}

-static int map_groups__create_modules(struct map_groups *self)
+static int __map_groups__create_modules(struct map_groups *self,
+ const char * filename, int guest)
{
char *line = NULL;
size_t n;
- FILE *file = fopen("/proc/modules", "r");
+ FILE *file = fopen(filename, "r");
struct map *map;

if (file == NULL)
@@ -1518,16 +1557,17 @@ static int map_groups__create_modules(st
*sep = '\0';

snprintf(name, sizeof(name), "[%s]", line);
- map = map_groups__new_module(self, start, name);
+ map = map_groups__new_module(self, start, name, guest);
if (map == NULL)
goto out_delete_line;
- dso__kernel_module_get_build_id(map->dso);
+ if (!guest)
+ dso__kernel_module_get_build_id(map->dso);
}

free(line);
fclose(file);

- return map_groups__set_modules_path(self);
+ return 0;

out_delete_line:
free(line);
@@ -1535,6 +1575,21 @@ out_failure:
return -1;
}

+static int map_groups__create_modules(struct map_groups *self)
+{
+ int ret;
+
+ ret = __map_groups__create_modules(self, "/proc/modules", 0);
+ if (ret >= 0)
+ ret = map_groups__set_modules_path(self);
+ return ret;
+}
+
+static int map_groups__create_guest_modules(struct map_groups *self)
+{
+ return __map_groups__create_modules(self, symbol_conf.guest_modules, 1);
+}
+
static int dso__load_vmlinux(struct dso *self, struct map *map,
const char *vmlinux, symbol_filter_t filter)
{
@@ -1694,8 +1749,44 @@ out_fixup:
return err;
}

+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+ symbol_filter_t filter)
+{
+ int err;
+ const char *kallsyms_filename;
+ /*
+ * if the user specified a vmlinux filename, use it and only
+ * it, reporting errors to the user if it cannot be used.
+ * Or use file guest_kallsyms inputted by user on commandline
+ */
+ if (symbol_conf.guest_vmlinux_name != NULL) {
+ err = dso__load_vmlinux(self, map,
+ symbol_conf.guest_vmlinux_name, filter);
+ goto out_try_fixup;
+ }
+
+ kallsyms_filename = symbol_conf.guest_kallsyms;
+ if (!kallsyms_filename)
+ return -1;
+ err = dso__load_kallsyms(self, kallsyms_filename, map, filter);
+ if (err > 0)
+ pr_debug("Using %s for symbols\n", kallsyms_filename);
+
+out_try_fixup:
+ if (err > 0) {
+ if (kallsyms_filename != NULL)
+ dso__set_long_name(self, strdup("[guest.kernel.kallsyms]"));
+ map__fixup_start(map);
+ map__fixup_end(map);
+ }
+
+ return err;
+}
+
LIST_HEAD(dsos__user);
LIST_HEAD(dsos__kernel);
+LIST_HEAD(dsos__guest_user);
+LIST_HEAD(dsos__guest_kernel);

static void dsos__add(struct list_head *head, struct dso *dso)
{
@@ -1742,6 +1833,8 @@ void dsos__fprintf(FILE *fp)
{
__dsos__fprintf(&dsos__kernel, fp);
__dsos__fprintf(&dsos__user, fp);
+ __dsos__fprintf(&dsos__guest_kernel, fp);
+ __dsos__fprintf(&dsos__guest_user, fp);
}

static size_t __dsos__fprintf_buildid(struct list_head *head, FILE *fp,
@@ -1771,7 +1864,19 @@ struct dso *dso__new_kernel(const char *

if (self != NULL) {
self->short_name = "[kernel]";
- self->kernel = 1;
+ self->kernel = DSO_TYPE_KERNEL;
+ }
+
+ return self;
+}
+
+struct dso *dso__new_guest_kernel(const char *name)
+{
+ struct dso *self = dso__new(name ?: "[guest.kernel.kallsyms]");
+
+ if (self != NULL) {
+ self->short_name = "[guest.kernel]";
+ self->kernel = DSO_TYPE_GUEST_KERNEL;
}

return self;
@@ -1796,6 +1901,15 @@ static struct dso *dsos__create_kernel(c
return kernel;
}

+static struct dso *dsos__create_guest_kernel(const char *vmlinux)
+{
+ struct dso *kernel = dso__new_guest_kernel(vmlinux);
+
+ if (kernel != NULL)
+ dsos__add(&dsos__guest_kernel, kernel);
+ return kernel;
+}
+
int __map_groups__create_kernel_maps(struct map_groups *self,
struct map *vmlinux_maps[MAP__NR_TYPES],
struct dso *kernel)
@@ -1955,3 +2069,24 @@ int map_groups__create_kernel_maps(struc
map_groups__fixup_end(self);
return 0;
}
+
+int map_groups__create_guest_kernel_maps(struct map_groups *self,
+ struct map *vmlinux_maps[MAP__NR_TYPES])
+{
+ struct dso *kernel = dsos__create_guest_kernel(symbol_conf.guest_vmlinux_name);
+
+ if (kernel == NULL)
+ return -1;
+
+ if (__map_groups__create_kernel_maps(self, vmlinux_maps, kernel) < 0)
+ return -1;
+
+ if (symbol_conf.use_modules && map_groups__create_guest_modules(self) < 0)
+ pr_debug("Problems creating module maps, continuing anyway...\n");
+ /*
+ * Now that we have all the maps created, just set the ->end of them:
+ */
+ map_groups__fixup_end(self);
+ return 0;
+}
+
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/symbol.h linux-2.6_tipmaster0315_perfkvm/tools/perf/util/symbol.h
--- linux-2.6_tipmaster0315/tools/perf/util/symbol.h 2010-03-16 08:59:54.880462306 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/symbol.h 2010-03-16 10:37:03.880361568 +0800
@@ -63,10 +63,14 @@ struct symbol_conf {
show_nr_samples,
use_callchain,
exclude_other,
- full_paths;
+ full_paths,
+ show_cpu_utilization;
const char *vmlinux_name,
*field_sep;
- char *dso_list_str,
+ const char *guest_vmlinux_name,
+ *guest_kallsyms,
+ *guest_modules;
+ char *dso_list_str,
*comm_list_str,
*sym_list_str,
*col_width_list_str;
@@ -95,6 +99,13 @@ struct addr_location {
u64 addr;
char level;
bool filtered;
+ unsigned int cpumode;
+};
+
+enum dso_kernel_type {
+ DSO_TYPE_USER = 0,
+ DSO_TYPE_KERNEL,
+ DSO_TYPE_GUEST_KERNEL
};

struct dso {
@@ -104,7 +115,7 @@ struct dso {
u8 adjust_symbols:1;
u8 slen_calculated:1;
u8 has_build_id:1;
- u8 kernel:1;
+ enum dso_kernel_type kernel;
u8 hit:1;
u8 annotate_warned:1;
unsigned char origin;
@@ -119,6 +130,7 @@ struct dso {

struct dso *dso__new(const char *name);
struct dso *dso__new_kernel(const char *name);
+struct dso *dso__new_guest_kernel(const char *name);
void dso__delete(struct dso *self);

bool dso__loaded(const struct dso *self, enum map_type type);
@@ -131,7 +143,7 @@ static inline void dso__set_loaded(struc

void dso__sort_by_name(struct dso *self, enum map_type type);

-extern struct list_head dsos__user, dsos__kernel;
+extern struct list_head dsos__user, dsos__kernel, dsos__guest_user, dsos__guest_kernel;

struct dso *__dsos__findnew(struct list_head *head, const char *name);

@@ -160,6 +172,8 @@ enum dso_origin {
DSO__ORIG_BUILDID,
DSO__ORIG_DSO,
DSO__ORIG_KMODULE,
+ DSO__ORIG_GUEST_KERNEL,
+ DSO__ORIG_GUEST_KMODULE,
DSO__ORIG_NOT_FOUND,
};

diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/thread.h linux-2.6_tipmaster0315_perfkvm/tools/perf/util/thread.h
--- linux-2.6_tipmaster0315/tools/perf/util/thread.h 2010-03-16 08:59:54.764469663 +0800
+++ linux-2.6_tipmaster0315_perfkvm/tools/perf/util/thread.h 2010-03-16 09:01:10.004081483 +0800
@@ -82,6 +82,9 @@ int __map_groups__create_kernel_maps(str
int map_groups__create_kernel_maps(struct map_groups *self,
struct map *vmlinux_maps[MAP__NR_TYPES]);

+int map_groups__create_guest_kernel_maps(struct map_groups *self,
+ struct map *vmlinux_maps[MAP__NR_TYPES]);
+
struct map *map_groups__new_module(struct map_groups *self, u64 start,
- const char *filename);
+ const char *filename, int guest);
#endif /* __PERF_THREAD_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Avi Kivity

unread,

Mar 16, 2010, 1:50:02 AM3/16/10

to

On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
> From: Zhang, Yanmin<yanmin...@linux.intel.com>
>
> Based on the discussion in KVM community, I worked out the patch to support
> perf to collect guest os statistics from host side. This patch is implemented
> with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
> critical bug and provided good suggestions with other guys. I really appreciate
> their kind help.
>
> The patch adds new subcommand kvm to perf.
>
> perf kvm top
> perf kvm record
> perf kvm report
> perf kvm diff
>
> The new perf could profile guest os kernel except guest os user space, but it
> could summarize guest os user space utilization per guest os.
>
> Below are some examples.
> 1) perf kvm top
> [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> --guestmodules=/home/ymzhang/guest/modules top
>
>

Excellent, support for guest kernel != host kernel is critical (I can't
remember the last time I ran same kernels).

How would we support multiple guests with different kernels? Perhaps a
symbol server that perf can connect to (and that would connect to guests
in turn)?

> diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
> --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800
> +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800
> @@ -26,6 +26,7 @@
> #include<linux/sched.h>
> #include<linux/moduleparam.h>
> #include<linux/ftrace_event.h>
> +#include<linux/perf_event.h>
> #include "kvm_cache_regs.h"
> #include "x86.h"
>
> @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
> vmcs_write32(TPR_THRESHOLD, irr);
> }
>
> +DEFINE_PER_CPU(int, kvm_in_guest) = {0};
> +
> +static void kvm_set_in_guest(void)
> +{
> + percpu_write(kvm_in_guest, 1);
> +}
> +
> +static int kvm_is_in_guest(void)
> +{
> + return percpu_read(kvm_in_guest);
> +}
>

There is already PF_VCPU for this.

> +static struct perf_guest_info_callbacks kvm_guest_cbs = {
> + .is_in_guest = kvm_is_in_guest,
> + .is_user_mode = kvm_is_user_mode,
> + .get_guest_ip = kvm_get_guest_ip,
> + .reset_in_guest = kvm_reset_in_guest,
> +};
>

Should be in common code, not vmx specific.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Ingo Molnar

unread,

Mar 16, 2010, 3:30:01 AM3/16/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
> >From: Zhang, Yanmin<yanmin...@linux.intel.com>
> >
> >Based on the discussion in KVM community, I worked out the patch to support
> >perf to collect guest os statistics from host side. This patch is implemented
> >with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
> >critical bug and provided good suggestions with other guys. I really appreciate
> >their kind help.
> >
> >The patch adds new subcommand kvm to perf.
> >
> > perf kvm top
> > perf kvm record
> > perf kvm report
> > perf kvm diff
> >
> >The new perf could profile guest os kernel except guest os user space, but it
> >could summarize guest os user space utilization per guest os.
> >
> >Below are some examples.
> >1) perf kvm top
> >[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> >--guestmodules=/home/ymzhang/guest/modules top
> >
>
> Excellent, support for guest kernel != host kernel is critical (I
> can't remember the last time I ran same kernels).
>
> How would we support multiple guests with different kernels? Perhaps a
> symbol server that perf can connect to (and that would connect to guests in
> turn)?

The highest quality solution would be if KVM offered a 'guest extension' to
the guest kernel's /proc/kallsyms that made it easy for user-space to get this
information from an authorative source.

That's the main reason why the host side /proc/kallsyms is so popular and so
useful: while in theory it's mostly redundant information which can be gleaned
from the System.map and other sources of symbol information, it's easily
available and is _always_ trustable to come from the host kernel.

Separate System.map's have a tendency to go out of sync (or go missing when a
devel kernel gets rebuilt, or if a devel package is not installed), and server
ports (be that a TCP port space server or an UDP port space mount-point) are
both a configuration hassle and are not guest-transparent.

So for instrumentation infrastructure (such as perf) we have a large and well
founded preference for intrinsic, built-in, kernel-provided information: i.e.
a largely 'built-in' and transparent mechanism to get to guest symbols.

Thanks,

Ingo

Zhang, Yanmin

unread,

Mar 16, 2010, 3:50:03 AM3/16/10

to

On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
> On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
> > From: Zhang, Yanmin<yanmin...@linux.intel.com>
> >
> > Based on the discussion in KVM community, I worked out the patch to support
> > perf to collect guest os statistics from host side. This patch is implemented
> > with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
> > critical bug and provided good suggestions with other guys. I really appreciate
> > their kind help.
> >
> > The patch adds new subcommand kvm to perf.
> >
> > perf kvm top
> > perf kvm record
> > perf kvm report
> > perf kvm diff
> >
> > The new perf could profile guest os kernel except guest os user space, but it
> > could summarize guest os user space utilization per guest os.
> >
> > Below are some examples.
> > 1) perf kvm top
> > [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> > --guestmodules=/home/ymzhang/guest/modules top
> >
> >
>

Thanks for your kind comments.

> Excellent, support for guest kernel != host kernel is critical (I can't
> remember the last time I ran same kernels).
>
> How would we support multiple guests with different kernels?

With the patch, 'perf kvm report --sort pid" could show
summary statistics for all guest os instances. Then, use
parameter --pid of 'perf kvm record' to collect single problematic instance data.

Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is very
narrow, I will change it to use flag PF_VCPU.

>
> > +static struct perf_guest_info_callbacks kvm_guest_cbs = {
> > + .is_in_guest = kvm_is_in_guest,
> > + .is_user_mode = kvm_is_user_mode,
> > + .get_guest_ip = kvm_get_guest_ip,
> > + .reset_in_guest = kvm_reset_in_guest,
> > +};
> >
>
> Should be in common code, not vmx specific.

Right. I discussed with Yangsheng. I will move above data structures and
callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
kvm_x86_ops.

Yanmin

Avi Kivity

unread,

Mar 16, 2010, 5:30:03 AM3/16/10

to

The symbol server's client can certainly access the bits through vmchannel.

--
error compiling committee.c: too many arguments to function

Zhang, Yanmin

unread,

Mar 16, 2010, 5:30:03 AM3/16/10

to

Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf record' and
'perf top' could collect data on all threads of a process when the process is running?

If not, I need add a new ugly parameter which is similar to --pid to filter out process
data in userspace.

Avi Kivity

unread,

Mar 16, 2010, 5:40:01 AM3/16/10

to

On 03/16/2010 11:28 AM, Zhang, Yanmin wrote:
> Sorry. I found currently --pid isn't process but a thread (main thread).
>
> Ingo,
>
> Is it possible to support a new parameter or extend --inherit, so 'perf record' and
> 'perf top' could collect data on all threads of a process when the process is running?
>

That seems like a worthwhile addition regardless of this thread.
Profile all current threads and any new ones. It probably makes sense
to call this --pid and rename the existing --pid to --thread.

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Mar 16, 2010, 5:40:02 AM3/16/10

to

On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
>
>> Excellent, support for guest kernel != host kernel is critical (I can't
>> remember the last time I ran same kernels).
>>
>> How would we support multiple guests with different kernels?
>>
> With the patch, 'perf kvm report --sort pid" could show
> summary statistics for all guest os instances. Then, use
> parameter --pid of 'perf kvm record' to collect single problematic instance data.
>

That certainly works, though automatic association of guest data with
guest symbols is friendlier.

There is also a window between setting the flag and calling 'int $2'
where an NMI might happen and be accounted incorrectly.

Perhaps separate the 'int $2' into a direct call into perf and another
call for the rest of NMI handling. I don't see how it would work on svm
though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs
will be disabled until the next IRET so it isn't racy, just tricky.

>>> +static struct perf_guest_info_callbacks kvm_guest_cbs = {
>>> + .is_in_guest = kvm_is_in_guest,
>>> + .is_user_mode = kvm_is_user_mode,
>>> + .get_guest_ip = kvm_get_guest_ip,
>>> + .reset_in_guest = kvm_reset_in_guest,
>>> +};
>>>
>>>
>> Should be in common code, not vmx specific.
>>
> Right. I discussed with Yangsheng. I will move above data structures and
> callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
> kvm_x86_ops.
>

You will need access to the vcpu pointer (kvm_rip_read() needs it), you
can put it in a percpu variable. I guess if it's not null, you know
you're in a guest, so no need for PF_VCPU.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 16, 2010, 5:50:03 AM3/16/10

to

Yeah. For maximum utility i'd suggest to extend --pid to include this, and
introduce --tid for the previous, limited-to-a-single-task functionality.

Most users would expect --pid to work like a 'late attach' - i.e. to work like
strace -f or like a gdb attach.

Ingo

Ingo Molnar

unread,

Mar 16, 2010, 6:00:02 AM3/16/10

to

Ok, that would work i suspect.

Would be nice to have the symbol server in tools/perf/ and also make it easy
to add it to the initrd via a .config switch or so.

That would have basically all of the advantages of being built into the kernel
(availability, configurability, transparency, hackability), while having all
the advantages of a user-space approach as well (flexibility, extensibility,
robustness, ease of maintenance, etc.).

If only we had tools/xorg/ integrated via the initrd that way ;-)

Thanks,

Ingo

Avi Kivity

unread,

Mar 16, 2010, 6:20:02 AM3/16/10

to

Note, I am not advocating building the vmchannel client into the host
kernel. While that makes everything simpler for the user, it increases
the kernel footprint with all the disadvantages that come with that (any
bug is converted into a host DoS or worse).

So, perf would connect to qemu via (say) a well-known unix domain
socket, which would then talk to the guest kernel.

I know you won't like it, we'll continue to disagree on this unfortunately.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 16, 2010, 6:30:02 AM3/16/10

to

> kernel. [...]

Neither am i. What i suggested was a user-space binary/executable built in
tools/perf and put into the initrd.

That approach has the advantages i listed above, without having the
disadvantages of in-kernel code you listed.

Thanks,

Ingo

Avi Kivity

unread,

Mar 16, 2010, 6:50:02 AM3/16/10

to

On 03/16/2010 12:20 PM, Ingo Molnar wrote:
>>>>
>>>> The symbol server's client can certainly access the bits through vmchannel.
>>>>
>>> Ok, that would work i suspect.
>>>
>>> Would be nice to have the symbol server in tools/perf/ and also make it easy
>>> to add it to the initrd via a .config switch or so.
>>>
>>> That would have basically all of the advantages of being built into the kernel
>>> (availability, configurability, transparency, hackability), while having all
>>> the advantages of a user-space approach as well (flexibility, extensibility,
>>> robustness, ease of maintenance, etc.).
>>>
>> Note, I am not advocating building the vmchannel client into the host
>> kernel. [...]
>>
> Neither am i. What i suggested was a user-space binary/executable built in
> tools/perf and put into the initrd.
>

I'm confused - initrd seems to be guest-side. I was talking about the
host side.

For the guest, placing the symbol server in tools/ is reasonable.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 16, 2010, 7:00:02 AM3/16/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/16/2010 12:20 PM, Ingo Molnar wrote:
> >>>>
> >>>>The symbol server's client can certainly access the bits through vmchannel.
> >>>Ok, that would work i suspect.
> >>>
> >>>Would be nice to have the symbol server in tools/perf/ and also make it easy
> >>>to add it to the initrd via a .config switch or so.
> >>>
> >>>That would have basically all of the advantages of being built into the kernel
> >>>(availability, configurability, transparency, hackability), while having all
> >>>the advantages of a user-space approach as well (flexibility, extensibility,
> >>>robustness, ease of maintenance, etc.).
> >>Note, I am not advocating building the vmchannel client into the host
> >>kernel. [...]
> >Neither am i. What i suggested was a user-space binary/executable built in
> >tools/perf and put into the initrd.
>
> I'm confused - initrd seems to be guest-side. I was talking about the host
> side.

host side doesnt need much support - just some client capability in perf
itself. I suspect vmchannels are sufficiently flexible and configuration-free
for such purposes? (i.e. like a filesystem in essence)

Ingo

Avi Kivity

unread,

Mar 16, 2010, 7:20:02 AM3/16/10

to

On 03/16/2010 12:50 PM, Ingo Molnar wrote:
>
>> I'm confused - initrd seems to be guest-side. I was talking about the host
>> side.
>>
> host side doesnt need much support - just some client capability in perf
> itself. I suspect vmchannels are sufficiently flexible and configuration-free
> for such purposes? (i.e. like a filesystem in essence)
>

I haven't followed vmchannel closely, but I think it is. vmchannel is
terminated in qemu on the host side, not in the host kernel. So perf
would need to connect to qemu.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 16, 2010, 7:30:01 AM3/16/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/16/2010 12:50 PM, Ingo Molnar wrote:
> >
> >>I'm confused - initrd seems to be guest-side. I was talking about the host
> >>side.
> >host side doesnt need much support - just some client capability in perf
> >itself. I suspect vmchannels are sufficiently flexible and configuration-free
> >for such purposes? (i.e. like a filesystem in essence)
>
> I haven't followed vmchannel closely, but I think it is. vmchannel is
> terminated in qemu on the host side, not in the host kernel. So perf would
> need to connect to qemu.

Hm, that sounds rather messy if we want to use it to basically expose kernel
functionality in a guest/host unified way. Is the qemu process discoverable in
some secure way? Can we trust it? Is there some proper tooling available to do
it, or do we have to push it through 2-3 packages to get such a useful feature
done?

( That is the general thought process how many cross-discipline useful
desktop/server features hit the bit bucket before having had any chance of
being vetted by users, and why Linux sucks so much when it comes to feature
integration and application usability. )

Ingo

Avi Kivity

unread,

Mar 16, 2010, 8:30:03 AM3/16/10

to

On 03/16/2010 01:25 PM, Ingo Molnar wrote:
>
>> I haven't followed vmchannel closely, but I think it is. vmchannel is
>> terminated in qemu on the host side, not in the host kernel. So perf would
>> need to connect to qemu.
>>
> Hm, that sounds rather messy if we want to use it to basically expose kernel
> functionality in a guest/host unified way. Is the qemu process discoverable in
> some secure way?

We know its pid.

> Can we trust it?

No choice, it contains the guest address space.

> Is there some proper tooling available to do
> it, or do we have to push it through 2-3 packages to get such a useful feature
> done?
>

libvirt manages qemu processes, but I don't think this should go through
libvirt. qemu can do this directly by opening a unix domain socket in a
well-known place.

> ( That is the general thought process how many cross-discipline useful
> desktop/server features hit the bit bucket before having had any chance of
> being vetted by users, and why Linux sucks so much when it comes to feature
> integration and application usability. )
>

You can't solve everything in the kernel, even with a well populated tools/.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 16, 2010, 8:40:02 AM3/16/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/16/2010 01:25 PM, Ingo Molnar wrote:
> >
> >>I haven't followed vmchannel closely, but I think it is. vmchannel is
> >>terminated in qemu on the host side, not in the host kernel. So perf would
> >>need to connect to qemu.
> >Hm, that sounds rather messy if we want to use it to basically expose kernel
> >functionality in a guest/host unified way. Is the qemu process discoverable in
> >some secure way?
>
> We know its pid.

How do i get a list of all 'guest instance PIDs', and what is the way to talk
to Qemu?

> > Can we trust it?
>
> No choice, it contains the guest address space.

I mean, i can trust a kernel service and i can trust /proc/kallsyms.

Can perf trust a random process claiming to be Qemu? What's the trust
mechanism here?

> > Is there some proper tooling available to do it, or do we have to push it
> > through 2-3 packages to get such a useful feature done?
>
> libvirt manages qemu processes, but I don't think this should go through
> libvirt. qemu can do this directly by opening a unix domain socket in a
> well-known place.

So Qemu has never run into such problems before?

( Sounds weird - i think Qemu configuration itself should be done via a
unix domain socket driven configuration protocol as well. )

> >( That is the general thought process how many cross-discipline useful
> > desktop/server features hit the bit bucket before having had any chance of
> > being vetted by users, and why Linux sucks so much when it comes to feature
> > integration and application usability. )
>
> You can't solve everything in the kernel, even with a well populated tools/.

Certainly not, but this is a technical problem in the kernel's domain, so it's
a fair (and natural) expectation to be able to solve this within the kernel
project.

Ingo

Avi Kivity

unread,

Mar 16, 2010, 8:50:02 AM3/16/10

to

On 03/16/2010 02:29 PM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>> On 03/16/2010 01:25 PM, Ingo Molnar wrote:
>>
>>>
>>>> I haven't followed vmchannel closely, but I think it is. vmchannel is
>>>> terminated in qemu on the host side, not in the host kernel. So perf would
>>>> need to connect to qemu.
>>>>
>>> Hm, that sounds rather messy if we want to use it to basically expose kernel
>>> functionality in a guest/host unified way. Is the qemu process discoverable in
>>> some secure way?
>>>
>> We know its pid.
>>
> How do i get a list of all 'guest instance PIDs',

Libvirt manages all qemus, but this should be implemented independently
of libvirt.

> and what is the way to talk
> to Qemu?
>

In general qemu exposes communication channels (such as the monitor) as
tcp connections, unix-domain sockets, stdio, etc. It's very flexible.

>>> Can we trust it?
>>>
>> No choice, it contains the guest address space.
>>
> I mean, i can trust a kernel service and i can trust /proc/kallsyms.
>
> Can perf trust a random process claiming to be Qemu? What's the trust
> mechanism here?
>

Obviously you can't trust anything you get from a guest, no matter how
you get it.

How do you trust a userspace program's symbols? you don't. How do you
get them? they're on a well-known location.

>>> Is there some proper tooling available to do it, or do we have to push it
>>> through 2-3 packages to get such a useful feature done?
>>>
>> libvirt manages qemu processes, but I don't think this should go through
>> libvirt. qemu can do this directly by opening a unix domain socket in a
>> well-known place.
>>
> So Qemu has never run into such problems before?
>
> ( Sounds weird - i think Qemu configuration itself should be done via a
> unix domain socket driven configuration protocol as well. )
>

That's exactly what happens. You invoke qemu with -monitor
unix:blah,server (or -qmp for a machine-readable format) and have your
management application connect to that. You can redirect guest serial
ports, console, parallel port, etc. to unix-domain or tcp sockets.
vmchannel is an extension of that mechanism.

>>> ( That is the general thought process how many cross-discipline useful
>>> desktop/server features hit the bit bucket before having had any chance of
>>> being vetted by users, and why Linux sucks so much when it comes to feature
>>> integration and application usability. )
>>>
>> You can't solve everything in the kernel, even with a well populated tools/.
>>
> Certainly not, but this is a technical problem in the kernel's domain, so it's
> a fair (and natural) expectation to be able to solve this within the kernel
> project.
>

Someone writing perf-gui outside the kernel would have the same
problems, no?

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 16, 2010, 9:10:02 AM3/16/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/16/2010 02:29 PM, Ingo Molnar wrote:

> > I mean, i can trust a kernel service and i can trust /proc/kallsyms.
> >
> > Can perf trust a random process claiming to be Qemu? What's the trust
> > mechanism here?
>
> Obviously you can't trust anything you get from a guest, no matter how you
> get it.

I'm not talking about the symbol strings and addresses, and the object
contents for allocation (or debuginfo). I'm talking about the basic protocol
of establishing which guest is which.

I.e. we really want to be able users to:

1) have it all working with a single guest, without having to specify 'which'
guest (qemu PID) to work with. That is the dominant usecase both for
developers and for a fair portion of testers.

2) Have some reasonable symbolic identification for guests. For example a
usable approach would be to have 'perf kvm list', which would list all
currently active guests:

$ perf kvm list
[1] Fedora
[2] OpenSuse
[3] Windows-XP
[4] Windows-7

And from that point on 'perf kvm -g OpenSuse record' would do the obvious
thing. Users will be able to just use the 'OpenSuse' symbolic name for
that guest, even if the guest got restarted and switched its main PID.

Any such facility needs trusted enumeration and a protocol where i can trust
that the information i got is authorative. (I.e. 'OpenSuse' truly matches to
the OpenSuse session - not to some local user starting up a Qemu instance that
claims to be 'OpenSuse'.)

Is such a scheme possible/available? I suspect all the KVM configuration tools
(i havent used them in some time - gui and command-line tools alike) use
similar methods to ease guest management?

Ingo

Avi Kivity

unread,

Mar 16, 2010, 9:20:02 AM3/16/10

to

On 03/16/2010 03:08 PM, Ingo Molnar wrote:
>
>>> I mean, i can trust a kernel service and i can trust /proc/kallsyms.
>>>
>>> Can perf trust a random process claiming to be Qemu? What's the trust
>>> mechanism here?
>>>
>> Obviously you can't trust anything you get from a guest, no matter how you
>> get it.
>>
> I'm not talking about the symbol strings and addresses, and the object
> contents for allocation (or debuginfo). I'm talking about the basic protocol
> of establishing which guest is which.
>

There is none. So far, qemu only dealt with managing just its own
guest, and left all multiple guest management to higher levels up the
stack (like libvirt).

> I.e. we really want to be able users to:
>
> 1) have it all working with a single guest, without having to specify 'which'
> guest (qemu PID) to work with. That is the dominant usecase both for
> developers and for a fair portion of testers.
>

That's reasonable if we can get it working simply.

> 2) Have some reasonable symbolic identification for guests. For example a
> usable approach would be to have 'perf kvm list', which would list all
> currently active guests:
>
> $ perf kvm list
> [1] Fedora
> [2] OpenSuse
> [3] Windows-XP
> [4] Windows-7
>
> And from that point on 'perf kvm -g OpenSuse record' would do the obvious
> thing. Users will be able to just use the 'OpenSuse' symbolic name for
> that guest, even if the guest got restarted and switched its main PID.
>
> Any such facility needs trusted enumeration and a protocol where i can trust
> that the information i got is authorative. (I.e. 'OpenSuse' truly matches to
> the OpenSuse session - not to some local user starting up a Qemu instance that
> claims to be 'OpenSuse'.)
>
> Is such a scheme possible/available? I suspect all the KVM configuration tools
> (i havent used them in some time - gui and command-line tools alike) use
> similar methods to ease guest management?
>

You can do that through libvirt, but that only works for guests started
through libvirt. libvirt provides command-line tools to list and manage
guests (for example autostarting them on startup), and tools built on
top of libvirt can manage guests graphically.

Looks like we have a layer inversion here. Maybe we need a plugin
system - libvirt drops a .so into perf that teaches it how to list
guests and get their symbols.

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Mar 16, 2010, 9:40:03 AM3/16/10

to

On 03/16/2010 03:31 PM, Ingo Molnar wrote:
>
>> You can do that through libvirt, but that only works for guests started
>> through libvirt. libvirt provides command-line tools to list and manage
>> guests (for example autostarting them on startup), and tools built on top of
>> libvirt can manage guests graphically.
>>
>> Looks like we have a layer inversion here. Maybe we need a plugin system -
>> libvirt drops a .so into perf that teaches it how to list guests and get
>> their symbols.
>>

> Is libvirt used to start up all KVM guests? If not, if it's only used on some
> distros while other distros have other solutions then there's apparently no
> good way to get to such information, and the kernel bits of KVM do not provide
> it.
>

Developers tend to start qemu from the command line, but the majority of
users and all distros I know of use libvirt. Some users cobble up their
own scripts.

> To the user (and to me) this looks like a KVM bug / missing feature. (and the
> user doesnt care where the blame is) If that is true then apparently the
> current KVM design has no technically actionable solution for certain
> categories of features!
>

A plugin system allows anyone who is interested to provide the
information; they just need to write a plugin for their management tool.

Since we can't prevent people from writing management tools, I don't see
what else we can do.

Ingo Molnar

unread,

Mar 16, 2010, 9:40:02 AM3/16/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/16/2010 03:08 PM, Ingo Molnar wrote:
> >
> >>>I mean, i can trust a kernel service and i can trust /proc/kallsyms.
> >>>
> >>>Can perf trust a random process claiming to be Qemu? What's the trust
> >>>mechanism here?
> >>Obviously you can't trust anything you get from a guest, no matter how you
> >>get it.
> >I'm not talking about the symbol strings and addresses, and the object
> >contents for allocation (or debuginfo). I'm talking about the basic protocol
> >of establishing which guest is which.
>
> There is none. So far, qemu only dealt with managing just its own
> guest, and left all multiple guest management to higher levels up
> the stack (like libvirt).
>
> >I.e. we really want to be able users to:
> >
> > 1) have it all working with a single guest, without having to specify 'which'
> > guest (qemu PID) to work with. That is the dominant usecase both for
> > developers and for a fair portion of testers.
>
> That's reasonable if we can get it working simply.

IMO such ease of use is reasonable and required, full stop.

If it cannot be gotten simply then that's a bug: either in the code, or in the
design, or in the development process that led to the design. Bugs need
fixing.

Is libvirt used to start up all KVM guests? If not, if it's only used on some

distros while other distros have other solutions then there's apparently no
good way to get to such information, and the kernel bits of KVM do not provide
it.

To the user (and to me) this looks like a KVM bug / missing feature. (and the

user doesnt care where the blame is) If that is true then apparently the
current KVM design has no technically actionable solution for certain
categories of features!

Ingo

Frank Ch. Eigler

unread,

Mar 16, 2010, 11:10:02 AM3/16/10

to

Ingo Molnar <mi...@elte.hu> writes:

> [...]

>> >I.e. we really want to be able users to:
>> >
>> > 1) have it all working with a single guest, without having to specify 'which'
>> > guest (qemu PID) to work with. That is the dominant usecase both for
>> > developers and for a fair portion of testers.
>>
>> That's reasonable if we can get it working simply.
>
> IMO such ease of use is reasonable and required, full stop.
> If it cannot be gotten simply then that's a bug: either in the code, or in the
> design, or in the development process that led to the design. Bugs need

> fixing. [...]

Perhaps the fact that kvm happens to deal with an interesting
application area (virtualization) is misleading here. As far as the
host kernel or other host userspace is concerned, qemu is just some
random unprivileged userspace program (with some *optional* /dev/kvm
services that might happen to require temporary root).

As such, perf trying to instrument qemu is no different than perf
trying to instrument any other userspace widget. Therefore, expecting
'trusted enumeration' of instances is just as sensible as using
'trusted ps' and 'trusted /var/run/FOO.pid files'.

- FChE

Ingo Molnar

unread,

Mar 16, 2010, 12:00:02 PM3/16/10

to

* Frank Ch. Eigler <fc...@redhat.com> wrote:

>
> Ingo Molnar <mi...@elte.hu> writes:
>
> > [...]
> >> >I.e. we really want to be able users to:
> >> >
> >> > 1) have it all working with a single guest, without having to specify 'which'
> >> > guest (qemu PID) to work with. That is the dominant usecase both for
> >> > developers and for a fair portion of testers.
> >>
> >> That's reasonable if we can get it working simply.
> >
> > IMO such ease of use is reasonable and required, full stop.
> > If it cannot be gotten simply then that's a bug: either in the code, or in the
> > design, or in the development process that led to the design. Bugs need
> > fixing. [...]
>
> Perhaps the fact that kvm happens to deal with an interesting application
> area (virtualization) is misleading here. As far as the host kernel or
> other host userspace is concerned, qemu is just some random unprivileged
> userspace program (with some *optional* /dev/kvm services that might happen
> to require temporary root).
>
> As such, perf trying to instrument qemu is no different than perf trying to
> instrument any other userspace widget. Therefore, expecting 'trusted
> enumeration' of instances is just as sensible as using 'trusted ps' and
> 'trusted /var/run/FOO.pid files'.

You are quite mistaken: KVM isnt really a 'random unprivileged application' in
this context, it is clearly an extension of system/kernel services.

( Which can be seen from the simple fact that what started the discussion was
'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
existing host-space /proc/kallsyms was desired. )

In that sense the most natural 'extension' would be the solution i mentioned a
week or two ago: to have a (read only) mount of all guest filesystems, plus a
channel for profiling/tracing data. That would make symbol parsing easier and
it's what extends the existing 'host space' abstraction in the most natural
way.

( It doesnt even have to be done via the kernel - Qemu could implement that
via FUSE for example. )

As a second best option a 'symbol server' might be used too.

Thanks,

Ingo

Frank Ch. Eigler

unread,

Mar 16, 2010, 12:10:02 PM3/16/10

to

Hi -

On Tue, Mar 16, 2010 at 04:52:21PM +0100, Ingo Molnar wrote:
> [...]
> > Perhaps the fact that kvm happens to deal with an interesting application
> > area (virtualization) is misleading here. As far as the host kernel or
> > other host userspace is concerned, qemu is just some random unprivileged

> > userspace program [...]

> You are quite mistaken: KVM isnt really a 'random unprivileged
> application' in this context, it is clearly an extension of
> system/kernel services.

I don't know what "extension of system/kernel services" means in this
context, beyond something running on the system/kernel, like every
other process. To clarify, to what extent do you consider your
classification similarly clear for a host is running

* multiple kvm instances run as unprivileged users
* non-kvm OS simulators such as vmware or xen or gdb
* kvm instances running something other than linux

> ( Which can be seen from the simple fact that what started the
> discussion was 'how do we get /proc/kallsyms from the
> guest'. I.e. an extension of the existing host-space /proc/kallsyms
> was desired. )

(Sorry, that smacks of circular reasoning.)

It may be a charming convenience function for perf users to give them
shortcuts for certain favoured configurations (kvm running freshest
linux), but that says more about perf than kvm.

- FChE

Ingo Molnar

unread,

Mar 16, 2010, 12:40:02 PM3/16/10

to

* Frank Ch. Eigler <fc...@redhat.com> wrote:

> Hi -
>
> On Tue, Mar 16, 2010 at 04:52:21PM +0100, Ingo Molnar wrote:
> > [...]
> > > Perhaps the fact that kvm happens to deal with an interesting application
> > > area (virtualization) is misleading here. As far as the host kernel or
> > > other host userspace is concerned, qemu is just some random unprivileged
> > > userspace program [...]
>
> > You are quite mistaken: KVM isnt really a 'random unprivileged
> > application' in this context, it is clearly an extension of
> > system/kernel services.
>
> I don't know what "extension of system/kernel services" means in this
> context, beyond something running on the system/kernel, like every other

> process. [...]

It means something like my example of 'extended to guest space'
/proc/kallsyms:

> > [...]

> >
> > ( Which can be seen from the simple fact that what started the
> > discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an
> > extension of the existing host-space /proc/kallsyms was desired. )
>
> (Sorry, that smacks of circular reasoning.)

To me it sounds like an example supporting my point. /proc/kallsyms is a
service by the kernel, and 'perf kvm' desires this to be extended to guest
space as well.

Thanks,

Ingo

Anthony Liguori

unread,

Mar 16, 2010, 1:20:01 PM3/16/10

to

On 03/16/2010 08:08 AM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>> On 03/16/2010 02:29 PM, Ingo Molnar wrote:
>>
>
>>> I mean, i can trust a kernel service and i can trust /proc/kallsyms.
>>>
>>> Can perf trust a random process claiming to be Qemu? What's the trust
>>> mechanism here?
>>>
>> Obviously you can't trust anything you get from a guest, no matter how you
>> get it.
>>
> I'm not talking about the symbol strings and addresses, and the object
> contents for allocation (or debuginfo). I'm talking about the basic protocol
> of establishing which guest is which.
>
> I.e. we really want to be able users to:
>
> 1) have it all working with a single guest, without having to specify 'which'
> guest (qemu PID) to work with. That is the dominant usecase both for
> developers and for a fair portion of testers.
>

You're making too many assumptions.

There is no list of guests anymore than there is a list of web browsers.

You can have a multi-tenant scenario where you have distinct groups of
virtual machines running as unprivileged users.

> 2) Have some reasonable symbolic identification for guests. For example a
> usable approach would be to have 'perf kvm list', which would list all
> currently active guests:
>
> $ perf kvm list
> [1] Fedora
> [2] OpenSuse
> [3] Windows-XP
> [4] Windows-7
>
> And from that point on 'perf kvm -g OpenSuse record' would do the obvious
> thing. Users will be able to just use the 'OpenSuse' symbolic name for
> that guest, even if the guest got restarted and switched its main PID.
>

Does "perf kvm list" always run as root? What if two unprivileged users
both have a VM named "Fedora"?

If we look at the use-case, it's going to be something like, a user is
creating virtual machines and wants to get performance information about
them.

Having to run a separate tool like perf is not going to be what they
would expect they had to do. Instead, they would either use their
existing GUI tool (like virt-manager) or they would use their management
interface (either QMP or libvirt).

The complexity of interaction is due to the fact that perf shouldn't be
a stand alone tool. It should be a library or something with a
programmatic interface that another tool can make use of.

Regards,

Anthony Liguori

> Is such a scheme possible/available? I suspect all the KVM configuration tools
> (i havent used them in some time - gui and command-line tools alike) use
> similar methods to ease guest management?
>
> Ingo
> --

> To unsubscribe from this list: send the line "unsubscribe kvm" in

Anthony Liguori

unread,

Mar 16, 2010, 1:40:03 PM3/16/10

to

On 03/16/2010 10:52 AM, Ingo Molnar wrote:
> You are quite mistaken: KVM isnt really a 'random unprivileged application' in
> this context, it is clearly an extension of system/kernel services.
>
> ( Which can be seen from the simple fact that what started the discussion was
> 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
> existing host-space /proc/kallsyms was desired. )
>

Random tools (like perf) should not be able to do what you describe.
It's a security nightmare.

If it's desirable to have /proc/kallsyms available, we can expose an
interface in QEMU to provide that. That can then be plumbed through
libvirt and QMP.

Then a management tool can use libvirt or QMP to obtain that information
and interact with the kernel appropriately.

> In that sense the most natural 'extension' would be the solution i mentioned a
> week or two ago: to have a (read only) mount of all guest filesystems, plus a
> channel for profiling/tracing data. That would make symbol parsing easier and
> it's what extends the existing 'host space' abstraction in the most natural
> way.
>
> ( It doesnt even have to be done via the kernel - Qemu could implement that
> via FUSE for example. )
>

No way. The guest has sensitive data and exposing it widely on the host
is a bad thing to do. It's a bad interface. We can expose specific
information about guests but only through our existing channels which
are validated through a security infrastructure.

Ultimately, your goal is to keep perf a simple tool with little
dependencies. But practically speaking, if you want to add features to
it, it's going to have to interact with other subsystems in the
appropriate way. That means, it's going to need to interact with
libvirt or QMP.

If you want all applications to expose their data via synthetic file
systems, then there's always plan9 :-)

Regards,

Anthony Liguori

Ingo Molnar

unread,

Mar 16, 2010, 1:50:03 PM3/16/10

to

* Anthony Liguori <ant...@codemonkey.ws> wrote:

> On 03/16/2010 08:08 AM, Ingo Molnar wrote:
> >* Avi Kivity<a...@redhat.com> wrote:
> >
> >>On 03/16/2010 02:29 PM, Ingo Molnar wrote:
> >>>I mean, i can trust a kernel service and i can trust /proc/kallsyms.
> >>>
> >>>Can perf trust a random process claiming to be Qemu? What's the trust
> >>>mechanism here?
> >>Obviously you can't trust anything you get from a guest, no matter how you
> >>get it.
> >I'm not talking about the symbol strings and addresses, and the object
> >contents for allocation (or debuginfo). I'm talking about the basic protocol
> >of establishing which guest is which.
> >
> >I.e. we really want to be able users to:
> >
> > 1) have it all working with a single guest, without having to specify 'which'
> > guest (qemu PID) to work with. That is the dominant usecase both for
> > developers and for a fair portion of testers.
>
> You're making too many assumptions.
>
> There is no list of guests anymore than there is a list of web browsers.
>
> You can have a multi-tenant scenario where you have distinct groups of
> virtual machines running as unprivileged users.

"multi-tenant" and groups is not a valid excuse at all for giving crappy
technology in the simplest case: when there's a single VM. Yes, eventually it
can be supported and any sane scheme will naturally support it too, but it's
by no means what we care about primarily when it comes to these tools.

I thought everyone learned the lesson behind SystemTap's failure (and to a
certain degree this was behind Oprofile's failure as well): when it comes to
tooling/instrumentation we dont want to concentrate on the fancy complex
setups and abstract requirements drawn up by CIOs, as development isnt being
done there. Concentrate on our developers today, and provide no-compromises
usability to those who contribute stuff.

If we dont help make the simplest (and most common) use-case convenient then
we are failing on a fundamental level.

> > 2) Have some reasonable symbolic identification for guests. For example a
> > usable approach would be to have 'perf kvm list', which would list all
> > currently active guests:
> >
> > $ perf kvm list
> > [1] Fedora
> > [2] OpenSuse
> > [3] Windows-XP
> > [4] Windows-7
> >
> > And from that point on 'perf kvm -g OpenSuse record' would do the obvious
> > thing. Users will be able to just use the 'OpenSuse' symbolic name for
> > that guest, even if the guest got restarted and switched its main PID.
>
> Does "perf kvm list" always run as root? What if two unprivileged users
> both have a VM named "Fedora"?

Again, the single-VM case is the most important case, by far. If you have
multiple VMs running and want to develop the kernel on multiple VMs (sounds
rather messy if you think it through ...), what would happen is similar to
what happens when we have two probes for example:

# perf probe schedule
Added new event:
probe:schedule (on schedule+0)

You can now use it on all perf tools, such as:

perf record -e probe:schedule -a sleep 1

# perf probe -f schedule
Added new event:
probe:schedule_1 (on schedule+0)

You can now use it on all perf tools, such as:

perf record -e probe:schedule_1 -a sleep 1

# perf probe -f schedule
Added new event:
probe:schedule_2 (on schedule+0)

You can now use it on all perf tools, such as:

perf record -e probe:schedule_2 -a sleep 1

Something similar could be used for KVM/Qemu: whichever got created first is
named 'Fedora', the second is named 'Fedora-2'.

> If we look at the use-case, it's going to be something like, a user is
> creating virtual machines and wants to get performance information about
> them.
>
> Having to run a separate tool like perf is not going to be what they would
> expect they had to do. Instead, they would either use their existing GUI
> tool (like virt-manager) or they would use their management interface
> (either QMP or libvirt).
>
> The complexity of interaction is due to the fact that perf shouldn't be a
> stand alone tool. It should be a library or something with a programmatic
> interface that another tool can make use of.

But ... a GUI interface/integration is of course possible too, and it's being
worked on.

perf is mainly a kernel developer tool, and kernel developers generally dont
use GUIs to do their stuff: which is the (sole) reason why its first ~850
commits of tools/perf/ were done without a GUI. We go where our developers
are.

In any case it's not an excuse to have no proper command-line tooling. In fact
if you cannot get simpler, more atomic command-line tooling right then you'll
probably doubly suck at doing a GUI as well.

Ingo
--

Ingo Molnar

unread,

Mar 16, 2010, 2:00:02 PM3/16/10

to

* Anthony Liguori <alig...@linux.vnet.ibm.com> wrote:

> On 03/16/2010 10:52 AM, Ingo Molnar wrote:
> >You are quite mistaken: KVM isnt really a 'random unprivileged application' in
> >this context, it is clearly an extension of system/kernel services.
> >
> >( Which can be seen from the simple fact that what started the discussion was
> > 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
> > existing host-space /proc/kallsyms was desired. )
>
> Random tools (like perf) should not be able to do what you describe. It's a
> security nightmare.

A security nightmare exactly how? Mind to go into details as i dont understand
your point.

> If it's desirable to have /proc/kallsyms available, we can expose an
> interface in QEMU to provide that. That can then be plumbed through libvirt
> and QMP.
>
> Then a management tool can use libvirt or QMP to obtain that information and
> interact with the kernel appropriately.
>
> > In that sense the most natural 'extension' would be the solution i
> > mentioned a week or two ago: to have a (read only) mount of all guest
> > filesystems, plus a channel for profiling/tracing data. That would make
> > symbol parsing easier and it's what extends the existing 'host space'
> > abstraction in the most natural way.
> >
> > ( It doesnt even have to be done via the kernel - Qemu could implement that
> > via FUSE for example. )
>
> No way. The guest has sensitive data and exposing it widely on the host is

> a bad thing to do. [...]

Firstly, you are putting words into my mouth, as i said nothing about
'exposing it widely'. I suggest exposing it under the privileges of whoever
has access to the guest image.

Secondly, regarding confidentiality, and this is guest security 101: whoever
can access the image on the host _already_ has access to all the guest data!

A Linux image can generally be loopback mounted straight away:

losetup -o 32256 /dev/loop0 ./guest-image.img
mount -o ro /dev/loop0 /mnt-guest

(Or, if you are an unprivileged user who cannot mount, it can be read via ext2
tools.)

There's nothing the guest can do about that. The host is in total control of
guest image data for heaven's sake!

All i'm suggesting is to make what is already possible more convenient.

Ingo

Anthony Liguori

unread,

Mar 16, 2010, 2:10:02 PM3/16/10

to

On 03/16/2010 12:52 PM, Ingo Molnar wrote:
> * Anthony Liguori<alig...@linux.vnet.ibm.com> wrote:
>
>
>> On 03/16/2010 10:52 AM, Ingo Molnar wrote:
>>
>>> You are quite mistaken: KVM isnt really a 'random unprivileged application' in
>>> this context, it is clearly an extension of system/kernel services.
>>>
>>> ( Which can be seen from the simple fact that what started the discussion was
>>> 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
>>> existing host-space /proc/kallsyms was desired. )
>>>
>> Random tools (like perf) should not be able to do what you describe. It's a
>> security nightmare.
>>
> A security nightmare exactly how? Mind to go into details as i dont understand
> your point.
>

Assume you're using SELinux to implement mandatory access control. How
do you label this file system?

Generally speaking, we don't know the difference between /proc/kallsyms
vs. /dev/mem if we do generic passthrough. While it might be safe to
have a relaxed label of kallsyms (since it's read only), it's clearly
not safe to do that for /dev/mem, /etc/shadow, or any file containing
sensitive information.

Rather, we ought to expose a higher level interface that we have more
confidence in with respect to understanding the ramifications of
exposing that guest data.

>>
>> No way. The guest has sensitive data and exposing it widely on the host is
>> a bad thing to do. [...]
>>
> Firstly, you are putting words into my mouth, as i said nothing about
> 'exposing it widely'. I suggest exposing it under the privileges of whoever
> has access to the guest image.
>

That doesn't work as nicely with SELinux.

It's completely reasonable to have a user that can interact in a read
only mode with a VM via libvirt but cannot read the guest's disk images
or the guest's memory contents.

> Secondly, regarding confidentiality, and this is guest security 101: whoever
> can access the image on the host _already_ has access to all the guest data!
>
> A Linux image can generally be loopback mounted straight away:
>
> losetup -o 32256 /dev/loop0 ./guest-image.img
> mount -o ro /dev/loop0 /mnt-guest
>
> (Or, if you are an unprivileged user who cannot mount, it can be read via ext2
> tools.)
>
> There's nothing the guest can do about that. The host is in total control of
> guest image data for heaven's sake!
>

It's not that simple in a MAC environment.

Regards,

Anthony Liguori

Ingo Molnar

unread,

Mar 16, 2010, 2:30:03 PM3/16/10

to

* Anthony Liguori <alig...@linux.vnet.ibm.com> wrote:

> On 03/16/2010 12:52 PM, Ingo Molnar wrote:
> >* Anthony Liguori<alig...@linux.vnet.ibm.com> wrote:
> >
> >>On 03/16/2010 10:52 AM, Ingo Molnar wrote:
> >>>You are quite mistaken: KVM isnt really a 'random unprivileged application' in
> >>>this context, it is clearly an extension of system/kernel services.
> >>>
> >>>( Which can be seen from the simple fact that what started the discussion was
> >>> 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
> >>> existing host-space /proc/kallsyms was desired. )
> >>Random tools (like perf) should not be able to do what you describe. It's a
> >>security nightmare.
> >A security nightmare exactly how? Mind to go into details as i dont understand
> >your point.
>
> Assume you're using SELinux to implement mandatory access control.
> How do you label this file system?
>
> Generally speaking, we don't know the difference between /proc/kallsyms vs.
> /dev/mem if we do generic passthrough. While it might be safe to have a
> relaxed label of kallsyms (since it's read only), it's clearly not safe to
> do that for /dev/mem, /etc/shadow, or any file containing sensitive
> information.

What's your _point_? Please outline a threat model, a vector of attack,
_anything_ that substantiates your "it's a security nightmare" claim.

> Rather, we ought to expose a higher level interface that we have more
> confidence in with respect to understanding the ramifications of exposing
> that guest data.

Exactly, we want something that has a flexible namespace and works well with
Linux tools in general. Preferably that namespace should be human readable,
and it should be hierarchic, and it should have a well-known permission model.

This concept exists in Linux and is generally called a 'filesystem'.

> >> No way. The guest has sensitive data and exposing it widely on the host
> >> is a bad thing to do. [...]
> >
> > Firstly, you are putting words into my mouth, as i said nothing about
> > 'exposing it widely'. I suggest exposing it under the privileges of
> > whoever has access to the guest image.
>
> That doesn't work as nicely with SELinux.
>
> It's completely reasonable to have a user that can interact in a read only
> mode with a VM via libvirt but cannot read the guest's disk images or the
> guest's memory contents.

If a user cannot read the image file then the user has no access to its
contents via other namespaces either. That is, of course, a basic security
aspect.

( That is perfectly true with a non-SELinux Unix permission model as well, and
is true in the SELinux case as well. )

> > Secondly, regarding confidentiality, and this is guest security 101: whoever
> > can access the image on the host _already_ has access to all the guest data!
> >
> > A Linux image can generally be loopback mounted straight away:
> >
> > losetup -o 32256 /dev/loop0 ./guest-image.img
> > mount -o ro /dev/loop0 /mnt-guest
> >
> >(Or, if you are an unprivileged user who cannot mount, it can be read via ext2
> >tools.)
> >
> > There's nothing the guest can do about that. The host is in total control of
> > guest image data for heaven's sake!
>
> It's not that simple in a MAC environment.

Erm. Please explain to me, what exactly is 'not that simple' in a MAC
environment?

Also, i'd like to note that the 'restrictive SELinux setups' usecases are
pretty secondary.

To demonstrate that, i'd like every KVM developer on this list who reads this
mail and who has their home development system where they produce their
patches set up in a restrictive MAC environment, in that you cannot even read
the images you are using, to chime in with a "I'm doing that" reply.

If there's just a _single_ KVM developer amongst dozens and dozens of
developers on this list who develops in an environment like that i'd be
surprised. That result should pretty much tell you where the weight of
instrumentation focus should lie - and it isnt on restrictive MAC environments
...

oerg Roedel

unread,

Mar 16, 2010, 6:40:02 PM3/16/10

to

On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote:
> Hm, that sounds rather messy if we want to use it to basically expose kernel
> functionality in a guest/host unified way. Is the qemu process discoverable in
> some secure way? Can we trust it? Is there some proper tooling available to do
> it, or do we have to push it through 2-3 packages to get such a useful feature
> done?

Since we want to implement a pmu usable for the guest anyway why we
don't just use a guests perf to get all information we want? If we get a
pmu-nmi from the guest we just re-inject it to the guest and perf in the
guest gives us all information we wand including kernel and userspace
symbols, stack traces, and so on.

In the previous thread we discussed about a direct trace channel between
guest and host kernel (which can be used for ftrace events for example).
This channel could be used to transport this information to the host
kernel.

The only additional feature needed is a way for the host to start a perf
instance in the guest.

Opinions?

Joerg

Masami Hiramatsu

unread,

Mar 16, 2010, 7:10:02 PM3/16/10

to

oerg Roedel wrote:
> On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote:
>> Hm, that sounds rather messy if we want to use it to basically expose kernel
>> functionality in a guest/host unified way. Is the qemu process discoverable in
>> some secure way? Can we trust it? Is there some proper tooling available to do
>> it, or do we have to push it through 2-3 packages to get such a useful feature
>> done?
>
> Since we want to implement a pmu usable for the guest anyway why we
> don't just use a guests perf to get all information we want? If we get a
> pmu-nmi from the guest we just re-inject it to the guest and perf in the
> guest gives us all information we wand including kernel and userspace
> symbols, stack traces, and so on.

I guess this aims to get information from old environments running on
kvm for life extension :)

> In the previous thread we discussed about a direct trace channel between
> guest and host kernel (which can be used for ftrace events for example).
> This channel could be used to transport this information to the host
> kernel.

Interesting! I know the people who are trying to do that with systemtap.
See, http://vesper.sourceforge.net/

>
> The only additional feature needed is a way for the host to start a perf
> instance in the guest.

# ssh localguest perf record --host-chanel ... ? B-)

Thank you,

>
> Opinions?
>
>
> Joerg
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Masami Hiramatsu
e-mail: mhir...@redhat.com

Anthony Liguori

unread,

Mar 16, 2010, 7:10:02 PM3/16/10

to

On 03/16/2010 01:28 PM, Ingo Molnar wrote:
> * Anthony Liguori<alig...@linux.vnet.ibm.com> wrote:
>
>
>> On 03/16/2010 12:52 PM, Ingo Molnar wrote:
>>
>>> * Anthony Liguori<alig...@linux.vnet.ibm.com> wrote:
>>>
>>>
>>>> On 03/16/2010 10:52 AM, Ingo Molnar wrote:
>>>>
>>>>> You are quite mistaken: KVM isnt really a 'random unprivileged application' in
>>>>> this context, it is clearly an extension of system/kernel services.
>>>>>
>>>>> ( Which can be seen from the simple fact that what started the discussion was
>>>>> 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the
>>>>> existing host-space /proc/kallsyms was desired. )
>>>>>
>>>> Random tools (like perf) should not be able to do what you describe. It's a
>>>> security nightmare.
>>>>
>>> A security nightmare exactly how? Mind to go into details as i dont understand
>>> your point.
>>>
>> Assume you're using SELinux to implement mandatory access control.
>> How do you label this file system?
>>
>> Generally speaking, we don't know the difference between /proc/kallsyms vs.
>> /dev/mem if we do generic passthrough. While it might be safe to have a
>> relaxed label of kallsyms (since it's read only), it's clearly not safe to
>> do that for /dev/mem, /etc/shadow, or any file containing sensitive
>> information.
>>
> What's your _point_? Please outline a threat model, a vector of attack,
> _anything_ that substantiates your "it's a security nightmare" claim.
>

You suggested "to have a (read only) mount of all guest filesystems".

As I described earlier, not all of the information within the guest
filesystem has the same level of sensitivity. If you exposed a generic
interface like this, it makes it very difficult to delegate privileges.

Delegating privileges is important because from in a higher security
environment, you may want to prevent a management tool from accessing
the VM's disk directly, but still allow it to do basic operations (in
particular, to view performance statistics).

>> Rather, we ought to expose a higher level interface that we have more
>> confidence in with respect to understanding the ramifications of exposing
>> that guest data.
>>
> Exactly, we want something that has a flexible namespace and works well with
> Linux tools in general. Preferably that namespace should be human readable,
> and it should be hierarchic, and it should have a well-known permission model.
>
> This concept exists in Linux and is generally called a 'filesystem'.
>

If you want to use a synthetic filesystem as the management interface
for qemu, that's one thing. But you suggested exposing the guest
filesystem in its entirely and that's what I disagreed with.

> If a user cannot read the image file then the user has no access to its
> contents via other namespaces either. That is, of course, a basic security
> aspect.
>
> ( That is perfectly true with a non-SELinux Unix permission model as well, and
> is true in the SELinux case as well. )
>

I don't think that's reasonable at all. The guest may encrypt it's disk
image. It still ought to be possible to run perf against that guest, no?

> Erm. Please explain to me, what exactly is 'not that simple' in a MAC
> environment?
>
> Also, i'd like to note that the 'restrictive SELinux setups' usecases are
> pretty secondary.
>
> To demonstrate that, i'd like every KVM developer on this list who reads this
> mail and who has their home development system where they produce their
> patches set up in a restrictive MAC environment, in that you cannot even read
> the images you are using, to chime in with a "I'm doing that" reply.
>

My home system doesn't run SELinux but I work daily with systems that
are using SELinux.

I want to be able to run tools like perf on these systems because
ultimately, I need to debug these systems on a daily basis.

But that's missing the point. We want to have an interface that works
for both cases so that we're not maintaining two separate interfaces.

We've rat holed a bit though. You want:

1) to run perf kvm list and be able to enumerate KVM guests

2) for this to Just Work with qemu guests launched from the command line

You could achieve (1) by tying perf to libvirt but that won't work for
(2). There are a few practical problems with (2).

qemu does not require the user to associate any uniquely identifying
information with a VM. We've also optimized the command line use case
so that if all you want to do is run a disk image, you just execute
"qemu foo.img". To satisfy your use case, we would either have to force
a use to always specify unique information, which would be less
convenient for our users or we would have to let the name be an optional
parameter.

As it turns out, we already support "qemu -name Fedora foo.img". What
we don't do today, but I've been suggesting we should, is automatically
create a QMP management socket in a well known location based on the
-name parameter when it's specified. That would let a tool like perf
Just Work provided that a user specified -name.

No one uses -name today though and I'm sure you don't either.

The only way to really address this is to change the interaction.
Instead of running perf externally to qemu, we should support a perf
command in the qemu monitor that can then tie directly to the perf
tooling. That gives us the best possible user experience.

We can't do that though unless perf is a library or is in some way more
programmatic.

Regards,

Anthony Liguori

unread,

Mar 16, 2010, 7:20:01 PM3/16/10

to

On 03/16/2010 12:39 PM, Ingo Molnar wrote:
>> If we look at the use-case, it's going to be something like, a user is
>> creating virtual machines and wants to get performance information about
>> them.
>>
>> Having to run a separate tool like perf is not going to be what they would
>> expect they had to do. Instead, they would either use their existing GUI
>> tool (like virt-manager) or they would use their management interface
>> (either QMP or libvirt).
>>
>> The complexity of interaction is due to the fact that perf shouldn't be a
>> stand alone tool. It should be a library or something with a programmatic
>> interface that another tool can make use of.
>>
> But ... a GUI interface/integration is of course possible too, and it's being
> worked on.
>
> perf is mainly a kernel developer tool, and kernel developers generally dont
> use GUIs to do their stuff: which is the (sole) reason why its first ~850
> commits of tools/perf/ were done without a GUI. We go where our developers
> are.
>
> In any case it's not an excuse to have no proper command-line tooling. In fact
> if you cannot get simpler, more atomic command-line tooling right then you'll
> probably doubly suck at doing a GUI as well.
>

It's about who owns the user interface.

If qemu owns the user interface, than we can satisfy this in a very
simple way by adding a perf monitor command. If we have to support
third party tools, then it significantly complicates things.

Regards,

Anthony Liguori

Frank Ch. Eigler

unread,

Mar 16, 2010, 8:50:01 PM3/16/10

to

Hi -

On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
> [...]

> The only way to really address this is to change the interaction.
> Instead of running perf externally to qemu, we should support a perf
> command in the qemu monitor that can then tie directly to the perf
> tooling. That gives us the best possible user experience.

To what extent could this be solved with less crossing of
isolation/abstraction layers, if the perfctr facilities were properly
virtualized? That way guests could run perf goo internally.
Optionally virt tools on the host side could aggregate data from
cooperating self-monitoring guests.

- FChE

Zhang, Yanmin

unread,

Mar 16, 2010, 10:40:01 PM3/16/10

to

On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
> On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
> >
> >> Excellent, support for guest kernel != host kernel is critical (I can't
> >> remember the last time I ran same kernels).
> >>
> >> How would we support multiple guests with different kernels?
> >>
> > With the patch, 'perf kvm report --sort pid" could show
> > summary statistics for all guest os instances. Then, use
> > parameter --pid of 'perf kvm record' to collect single problematic instance data.
> >
>
> That certainly works, though automatic association of guest data with
> guest symbols is friendlier.
Thanks. Originally, I planed to add a -G parameter to perf. Such like
-G 8888:/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules
8888 and 8889 are just qemu guest pid.

So we could define multiple guest os symbol files. But it seems ugly,
and 'perf kvm report --sort pid" and 'perf kvm top --pid' could provide
similar functionality.

>
> >>> diff -Nraup linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c
> >>> --- linux-2.6_tipmaster0315/arch/x86/kvm/vmx.c 2010-03-16 08:59:11.825295404 +0800
> >>> +++ linux-2.6_tipmaster0315_perfkvm/arch/x86/kvm/vmx.c 2010-03-16 09:01:09.976084492 +0800
> >>> @@ -26,6 +26,7 @@
> >>> #include<linux/sched.h>
> >>> #include<linux/moduleparam.h>
> >>> #include<linux/ftrace_event.h>
> >>> +#include<linux/perf_event.h>
> >>> #include "kvm_cache_regs.h"
> >>> #include "x86.h"
> >>>
> >>> @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct
> >>> vmcs_write32(TPR_THRESHOLD, irr);
> >>> }
> >>>
> >>> +DEFINE_PER_CPU(int, kvm_in_guest) = {0};
> >>> +
> >>> +static void kvm_set_in_guest(void)
> >>> +{
> >>> + percpu_write(kvm_in_guest, 1);
> >>> +}
> >>> +
> >>> +static int kvm_is_in_guest(void)
> >>> +{
> >>> + return percpu_read(kvm_in_guest);
> >>> +}
> >>>
> >>>
> >>
> >
> >> There is already PF_VCPU for this.
> >>
> > Right, but there is a scope between kvm_guest_enter and really running
> > in guest os, where a perf event might overflow. Anyway, the scope is very
> > narrow, I will change it to use flag PF_VCPU.
> >
>
> There is also a window between setting the flag and calling 'int $2'
> where an NMI might happen and be accounted incorrectly.
>
> Perhaps separate the 'int $2' into a direct call into perf and another
> call for the rest of NMI handling. I don't see how it would work on svm
> though - AFAICT the NMI is held whereas vmx swallows it.

> I guess NMIs
> will be disabled until the next IRET so it isn't racy, just tricky.
I'm not sure if vmexit does break NMI context or not. Hardware NMI context
isn't reentrant till a IRET. YangSheng would like to double check it.

>
> >>> +static struct perf_guest_info_callbacks kvm_guest_cbs = {
> >>> + .is_in_guest = kvm_is_in_guest,
> >>> + .is_user_mode = kvm_is_user_mode,
> >>> + .get_guest_ip = kvm_get_guest_ip,
> >>> + .reset_in_guest = kvm_reset_in_guest,
> >>> +};
> >>>
> >>>
> >> Should be in common code, not vmx specific.
> >>
> > Right. I discussed with Yangsheng. I will move above data structures and
> > callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
> > kvm_x86_ops.
> >
>
> You will need access to the vcpu pointer (kvm_rip_read() needs it), you
> can put it in a percpu variable.
We do so now in a new patch.

> I guess if it's not null, you know
> you're in a guest, so no need for PF_VCPU.
Good suggestion.

Thanks.

Avi Kivity

unread,

Mar 17, 2010, 12:00:01 AM3/17/10

to

On 03/17/2010 02:41 AM, Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
>
>> [...]
>> The only way to really address this is to change the interaction.
>> Instead of running perf externally to qemu, we should support a perf
>> command in the qemu monitor that can then tie directly to the perf
>> tooling. That gives us the best possible user experience.
>>
> To what extent could this be solved with less crossing of
> isolation/abstraction layers, if the perfctr facilities were properly
> virtualized?
>

That's the more interesting (by far) usage model. In general guest
owners don't have access to the host, and host owners can't (and
shouldn't) change guests.

Monitoring guests from the host is useful for kvm developers, but less
so for users.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Ingo Molnar

unread,

Mar 17, 2010, 3:30:01 AM3/17/10

to

* oerg Roedel <jo...@8bytes.org> wrote:

> On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote:
> > Hm, that sounds rather messy if we want to use it to basically expose kernel
> > functionality in a guest/host unified way. Is the qemu process discoverable in
> > some secure way? Can we trust it? Is there some proper tooling available to do
> > it, or do we have to push it through 2-3 packages to get such a useful feature
> > done?
>
> Since we want to implement a pmu usable for the guest anyway why we don't

> just use a guests perf to get all information we want? [...]

Look at the previous posting of this patch, this is something new and rather
unique. The main power in the 'perf kvm' kind of instrumentation is to profile
_both_ the host and the guest on the host, using the same tool (often using
the same kernel) and using similar workloads, and do profile comparisons using
'perf diff'.

Note that KVM's in-kernel design makes it easy to offer this kind of
host/guest shared implementation that Yanmin has created. Other virtulization
solutions with a poorer design (for example where the hypervisor code base is
split away from the guest implementation) will have it much harder to create
something similar.

That kind of integrated approach can result in very interesting finds straight
away, see:

http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/00613.html

( the profile there demoes the need for spinlock accelerators for example -
there's clearly assymetrically large overhead in guest spinlock code. Guess
how much else we'll be able to find with a full 'perf kvm' implementation. )

One of the main goals of a virtualization implementation is to eliminate as
many performance differences to the host kernel as possible. From the first
day KVM was released the overriding question from users was always: 'how much
slower is it than native, and which workloads are hit worst, and why, and
could you pretty please speed up important workload XYZ'.

'perf kvm' helps exactly that kind of development workflow.

Note that with oprofile you can already do separate guest space and host space
profiling (with the timer driven fallbackin the guest). One idea with 'perf
kvm' is to change that paradigm of forced separation and forced duplication
and to supprt the workflow that most developers employ: use the host space for
development and unify instrumentation in an intuitive framework. Yanmin's
'perf kvm' patch is a very good step towards that goal.

Anyway ... look at the patches, try them and see it for yourself. Back in the
days when i did KVM performance work i wish i had something like Yanmin's
'perf kvm' feature. I'd probably still be hacking KVM today ;-)

So, the code is there, it's useful and it's up to you guys whether you live
with this opportunity - the perf developers are certainly eager to help out
with the details. There's already tons of per kernel subsystem perf helper
tools: perf sched, perf kmem, perf lock, perf bench, perf timechart.

'perf kvm' is really a natural and good next step IMO that underlines the main
design goodness KVM brought to the world of virtualization: proper guest/host
code base integration.

Thanks,

Ingo

Ingo Molnar

unread,

Mar 17, 2010, 4:20:01 AM3/17/10

to

* Frank Ch. Eigler <fc...@redhat.com> wrote:

> Hi -
>
> On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote:
> > [...]
> > The only way to really address this is to change the interaction.
> > Instead of running perf externally to qemu, we should support a perf
> > command in the qemu monitor that can then tie directly to the perf
> > tooling. That gives us the best possible user experience.
>
> To what extent could this be solved with less crossing of
> isolation/abstraction layers, if the perfctr facilities were properly

> virtualized? [...]

Note, 'perfctr' is a different out-of-tree Linux kernel project run by someone
else: it offers the /dev/perfctr special-purpose device that allows raw,
unabstracted, low-level access to the PMU.

I suspect the one you wanted to mention here is called 'perf' or 'perf
events'. (and used to be called 'performance counters' or 'perfcounters' until
it got renamed about a year ago)

Thanks,

Ingo

Ingo Molnar

unread,

Mar 17, 2010, 4:20:01 AM3/17/10

to

* Avi Kivity <a...@redhat.com> wrote:

> Monitoring guests from the host is useful for kvm developers, but less so
> for users.

Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf'
will work if a proper paravirt channel is opened to the host)

I think you might have misunderstood the purpose and role of the 'perf kvm'
patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM
code, not guest kernel users.

Ingo

Ingo Molnar

unread,

Mar 17, 2010, 4:20:02 AM3/17/10

to

* Anthony Liguori <ant...@codemonkey.ws> wrote:

Of course illogical modularization complicates things 'significantly'.

I wish both you and Avi looked back 3-4 years and realized what made KVM so
successful back then and why the hearts and minds of virtualization developers
were captured by KVM almost overnight.

KVM's main strength back then was that it was a surprisingly functional piece
of code offered by a 10 KLOC patch - right on the very latest upstream kernel.
Code was shared with upstream, there was version parity, and it all was in the
same single repo which was (and is) a pleasure to develop on.

Unlike Xen, which was a 200+ KLOC patch on top of a forked 10 MLOC kernel a
few upstream versions back. Xen had constant version friction due to that fork
and due to that forced/false separation/modularization: Xen _itself_ was a
fork of Linux to begin with. (for exampe Xen still had my copyrights last i
checked, which it got from old Linux code i worked on)

That forced separation and version friction in Xen was a development and
productization nightmare, and developing on KVM was a truly refreshing
experience. (I'll go out on a limb to declare that you wont find a _single_
developer on this list who will tells us otherwise.)

Fast forward to 2010. The kernel side of KVM is maximum goodness - by far the
worst-quality remaining aspects of KVM are precisely in areas that you
mention: 'if we have to support third party tools, then it significantly
complicates things'. You kept Qemu as an external 'third party' entity to KVM,
and KVM is clearly hurting from that - just see the recent KVM usability
thread for examples about suckage.

So a similar 'complication' is the crux of the matter behind KVM quality
problems: you've not followed through with the original KVM vision and you
have not applied that concept to Qemu!

And please realize that the user does not care that KVM's kernel bits are top
notch, if the rest of the package has sucky aspects: it's always the weakest
link of the chain that matters to the user.

Xen sucked because of such design shortsightedness on the kernel level, and
now KVM suffers from it on the user space level.

If you want to jump to the next level of technological quality you need to fix
this attitude and you need to go back to the design roots of KVM. Concentrate
on Qemu (as that is the weakest link now), make it a first class member of the
KVM repo and simplify your development model by having a single repo:

- move a clean (and minimal) version of the Qemu code base to tools/kvm/, in
the upstream kernel repo, and work on that from that point on.

- co-develop new features within the same patch. Release new versions of
kvm-qemu and the kvm bits at the same time (together with the upstream
kernel), at well defined points in time.

- encourage kernel-space and user-space KVM developers to work on both
user-space and kernel-space bits as a single unit. It's one project and a
single experience to the user.

- [ and probably libvirt should go there too ]

If KVM's hypervisor and guest kernel code can enjoy the benefits of a single
repository, why cannot the rest of KVM enjoy the same developer goodness? Only
fixing that will bring the break-through in quality - not more manpower
really.

Yes, i've read a thousand excuses for why this is an absolutely impossible and
a bad thing to do, and none of them was really convincing to me - and you also
have become rather emotional about all the arguments so it's hard to argue
about it on a technical basis.

We made a similar (admittedly very difficult ...) design jump from oprofile to
perf, and i can tell you from that experience that it's day and night, both in
terms of development and in terms of the end result!

( We recently also made another, kernel/kernel unification that had a very
positive result: we unified the 32-bit and 64-bit x86 architectures. Even
within the same repo the unification of technology is generally a good
thing. The KVM/Qemu situation is different - it's more similar to the perf
design. )

Not having to fight artificial package boundaries and forced package
separation is very refreshing experience to a developer - and very rewarding
and flexible to develop on. ABI compatibility is _easier_ to maintain in such
a model. It's quite similar to the jump from Xen hacking to KVM hacking (i did
both). It's a bit like the jump from CVS to Git. Trust me, you _cannot_ know
the difference if you havent tried a similar jump with Qemu.

Anyway, you made your position about this rather clear and you are clearly
uncompromising, so i just wanted to post this note to the list: you'll waste
years of your life on a visibly crappy development model that has been unable
to break through a magic usability barrier for the past 2-3 years - just like
the Xen mis-design has wasted so many people's time and effort in kernel
space.

Thanks,

Avi Kivity

unread,

Mar 17, 2010, 4:30:02 AM3/17/10

to

On 03/17/2010 10:16 AM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>> Monitoring guests from the host is useful for kvm developers, but less so
>> for users.
>>
> Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf'
> will work if a proper paravirt channel is opened to the host)
>
> I think you might have misunderstood the purpose and role of the 'perf kvm'
> patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM
> code, not guest kernel users.
>

Of course I understood it. My point was that 'perf kvm' serves a tiny
minority of users. That doesn't mean it isn't useful, just that it
doesn't satisfy all needs by itself.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Ingo Molnar

unread,

Mar 17, 2010, 5:00:02 AM3/17/10

to

* Anthony Liguori <alig...@linux.vnet.ibm.com> wrote:

> If you want to use a synthetic filesystem as the management interface for
> qemu, that's one thing. But you suggested exposing the guest filesystem in
> its entirely and that's what I disagreed with.

What did you think, that it would be world-readable? Why would we do such a
stupid thing? Any mounted content should at minimum match whatever policy
covers the image file. The mounting of contents is not a privilege escallation
and it is already possible today - just not integrated properly and not
practical. (and apparently not implemented for all the wrong 'security'
reasons)

> The guest may encrypt it's disk image. It still ought to be possible to run
> perf against that guest, no?

_In_ the guest you can of course run it just fine. (once paravirt bits are in
place)

That has no connection to 'perf kvm' though, which this patch submission is
about ...

If you want unified profiling of both host and guest then you need access to
both the guest and the host. This is what the 'perf kvm' patch is about.
Please read the patch, i think you might be misunderstanding what it does ...

Regarding encrypted contents - that's really a distraction but the host has
absolute, 100% control over the guest and there's nothing the guest can do
about that - unless you are thinking about the sub-sub-case of Orwellian
DRM-locked-down systems - in which case there's nothing for the host to mount
and the guest can reject any requests for information on itself and impose
additional policy that way. So it's a security non-issue.

Note that DRM is pretty much the worst place to look at when it comes to
usability: DRM lock-down is the anti-thesis of usability. Do you really want
KVM to match the mind-set of the RIAA and MPAA? Why do you pretend that a
developer cannot mount his own disk image? Pretty please, help Linux instead,
where development is driven by usability and accessibility ...

Thanks,

Ingo

Ingo Molnar

unread,

Mar 17, 2010, 5:00:02 AM3/17/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/17/2010 10:16 AM, Ingo Molnar wrote:
> >* Avi Kivity<a...@redhat.com> wrote:
> >
> >> Monitoring guests from the host is useful for kvm developers, but less so
> >> for users.
> >
> > Guest space profiling is easy, and 'perf kvm' is not about that. (plain
> > 'perf' will work if a proper paravirt channel is opened to the host)
> >
> > I think you might have misunderstood the purpose and role of the 'perf
> > kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who
> > improve KVM code, not guest kernel users.
>
> Of course I understood it. My point was that 'perf kvm' serves a tiny

> minority of users. [...]

I hope you wont be disappointed to learn that 100% of Linux, all 13+ million
lines of it, was and is being developed by a tiny, tiny, tiny minority of
users ;-)

> [...] That doesn't mean it isn't useful, just that it doesn't satisfy all
> needs by itself.

Of course - and it doesnt bring world peace either. One step at a time.

Thanks,

Ingo

Sheng Yang

unread,

Mar 17, 2010, 5:30:02 AM3/17/10

to

On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote:
> On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote:
> > On 03/16/2010 09:48 AM, Zhang, Yanmin wrote:
> > > Right, but there is a scope between kvm_guest_enter and really running
> > > in guest os, where a perf event might overflow. Anyway, the scope is
> > > very narrow, I will change it to use flag PF_VCPU.
> >
> > There is also a window between setting the flag and calling 'int $2'
> > where an NMI might happen and be accounted incorrectly.
> >
> > Perhaps separate the 'int $2' into a direct call into perf and another
> > call for the rest of NMI handling. I don't see how it would work on svm
> > though - AFAICT the NMI is held whereas vmx swallows it.
> >
> > I guess NMIs
> > will be disabled until the next IRET so it isn't racy, just tricky.
>
> I'm not sure if vmexit does break NMI context or not. Hardware NMI context
> isn't reentrant till a IRET. YangSheng would like to double check it.

After more check, I think VMX won't remained NMI block state for host. That's
means, if NMI happened and processor is in VMX non-root mode, it would only
result in VMExit, with a reason indicate that it's due to NMI happened, but no
more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle the NMI.
Moreover, I think we _can't_ stop the re-entrance of NMI handling code because
"int $2" don't have effect to block following NMI.

And if the NMI sequence is not important(I think so), then we need to generate
a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to
itself is a good idea.

I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to replace
"int $2". Something unexpected is happening...

--
regards
Yang, Sheng

Zhang, Yanmin

unread,

Mar 17, 2010, 5:30:03 AM3/17/10

to

On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
>
> > On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote:
> > > On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote:
> > > > On 03/16/2010 07:27 AM, Zhang, Yanmin wrote:
> > > > > From: Zhang, Yanmin<yanmin...@linux.intel.com>
> > > > >
> > > > > Based on the discussion in KVM community, I worked out the patch to support
> > > > > perf to collect guest os statistics from host side. This patch is implemented
> > > > > with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
> > > > > critical bug and provided good suggestions with other guys. I really appreciate
> > > > > their kind help.
> > > > >
> > > > > The patch adds new subcommand kvm to perf.
> > > > >
> > > > > perf kvm top
> > > > > perf kvm record
> > > > > perf kvm report
> > > > > perf kvm diff
> > > > >
> > > > > The new perf could profile guest os kernel except guest os user space, but it
> > > > > could summarize guest os user space utilization per guest os.
> > > > >
> > > > > Below are some examples.
> > > > > 1) perf kvm top
> > > > > [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
> > > > > --guestmodules=/home/ymzhang/guest/modules top
> > > > >
> > > > >
> > > >
> > > Thanks for your kind comments.

> > >
> > > > Excellent, support for guest kernel != host kernel is critical (I can't
> > > > remember the last time I ran same kernels).
> > > >
> > > > How would we support multiple guests with different kernels?
> > > With the patch, 'perf kvm report --sort pid" could show
> > > summary statistics for all guest os instances. Then, use
> > > parameter --pid of 'perf kvm record' to collect single problematic instance data.

> > Sorry. I found currently --pid isn't process but a thread (main thread).
> >
> > Ingo,
> >
> > Is it possible to support a new parameter or extend --inherit, so 'perf
> > record' and 'perf top' could collect data on all threads of a process when
> > the process is running?
> >
> > If not, I need add a new ugly parameter which is similar to --pid to filter
> > out process data in userspace.
>
> Yeah. For maximum utility i'd suggest to extend --pid to include this, and
> introduce --tid for the previous, limited-to-a-single-task functionality.
>
> Most users would expect --pid to work like a 'late attach' - i.e. to work like
> strace -f or like a gdb attach.

Thanks Ingo, Avi.

I worked out below patch against tip/master of March 15th.

Subject: [PATCH] Change perf's parameter --pid to process-wide collection
From: Zhang, Yanmin <yanmin...@linux.intel.com>

Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
thread id. Now, --pid means perf collects the statistics of all threads of
the process, while --tid means perf just collect the statistics of that thread.

BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
attr->disabled=1 if it isn't a system-wide collection. If there is a '-p'
and no forks, 'perf stat -p' doesn't collect any data. In addition, the
while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact
on running workload. I added a sleep(1) in the loop.

Signed-off-by: Zhang Yanmin <yanmin...@linux.intel.com>

---

diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 08:59:54.896488489 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 16:30:17.755551706 +0800
@@ -27,7 +27,7 @@
#include <unistd.h>
#include <sched.h>

-static int fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *fd[MAX_NR_CPUS][MAX_COUNTERS];

static long default_interval = 0;

@@ -43,6 +43,9 @@ static int raw_samples = 0;
static int system_wide = 0;
static int profile_cpu = -1;
static pid_t target_pid = -1;
+static pid_t target_tid = -1;
+static int *all_tids = NULL;
+static int thread_num = 0;
static pid_t child_pid = -1;
static int inherit = 1;
static int force = 0;
@@ -60,7 +63,7 @@ static struct timeval this_read;

static u64 bytes_written = 0;

-static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS];
+static struct pollfd *event_array;

static int nr_poll = 0;
static int nr_cpu = 0;
@@ -77,7 +80,7 @@ struct mmap_data {
unsigned int prev;
};

-static struct mmap_data mmap_array[MAX_NR_CPUS][MAX_COUNTERS];
+static struct mmap_data *mmap_array[MAX_NR_CPUS][MAX_COUNTERS];

static unsigned long mmap_read_head(struct mmap_data *md)
{
@@ -225,11 +228,12 @@ static struct perf_header_attr *get_head
return h_attr;
}

-static void create_counter(int counter, int cpu, pid_t pid, bool forks)
+static void create_counter(int counter, int cpu, bool forks)
{
char *filter = filters[counter];
struct perf_event_attr *attr = attrs + counter;
struct perf_header_attr *h_attr;
+ int thread_index;
int track = !counter; /* only the first counter needs these */
int ret;
struct {
@@ -280,117 +284,124 @@ static void create_counter(int counter,
if (forks)
attr->enable_on_exec = 1;

+ for (thread_index = 0; thread_index < thread_num; thread_index++) {
try_again:
- fd[nr_cpu][counter] = sys_perf_event_open(attr, pid, cpu, group_fd, 0);
+ fd[nr_cpu][counter][thread_index] = sys_perf_event_open(attr,
+ all_tids[thread_index], cpu, group_fd, 0);

- if (fd[nr_cpu][counter] < 0) {
- int err = errno;
+ if (fd[nr_cpu][counter][thread_index] < 0) {
+ int err = errno;

- if (err == EPERM || err == EACCES)
- die("Permission error - are you root?\n"
- "\t Consider tweaking /proc/sys/kernel/perf_event_paranoid.\n");
- else if (err == ENODEV && profile_cpu != -1)
- die("No such device - did you specify an out-of-range profile CPU?\n");
+ if (err == EPERM || err == EACCES)
+ die("Permission error - are you root?\n"
+ "\t Consider tweaking /proc/sys/kernel/perf_event_paranoid.\n");
+ else if (err == ENODEV && profile_cpu != -1)
+ die("No such device - did you specify an out-of-range profile CPU?\n");

- /*
- * If it's cycles then fall back to hrtimer
- * based cpu-clock-tick sw counter, which
- * is always available even if no PMU support:
- */
- if (attr->type == PERF_TYPE_HARDWARE
- && attr->config == PERF_COUNT_HW_CPU_CYCLES) {
+ /*
+ * If it's cycles then fall back to hrtimer
+ * based cpu-clock-tick sw counter, which
+ * is always available even if no PMU support:
+ */
+ if (attr->type == PERF_TYPE_HARDWARE
+ && attr->config == PERF_COUNT_HW_CPU_CYCLES) {

- if (verbose)
- warning(" ... trying to fall back to cpu-clock-ticks\n");
- attr->type = PERF_TYPE_SOFTWARE;
- attr->config = PERF_COUNT_SW_CPU_CLOCK;
- goto try_again;
- }
- printf("\n");
- error("perfcounter syscall returned with %d (%s)\n",
- fd[nr_cpu][counter], strerror(err));
+ if (verbose)
+ warning(" ... trying to fall back to cpu-clock-ticks\n");
+ attr->type = PERF_TYPE_SOFTWARE;
+ attr->config = PERF_COUNT_SW_CPU_CLOCK;
+ goto try_again;
+ }
+ printf("\n");
+ error("perfcounter syscall returned with %d (%s)\n",
+ fd[nr_cpu][counter][thread_index],
+ strerror(err));

#if defined(__i386__) || defined(__x86_64__)
- if (attr->type == PERF_TYPE_HARDWARE && err == EOPNOTSUPP)
- die("No hardware sampling interrupt available. No APIC? If so then you can boot the kernel with the \"lapic\" boot parameter to force-enable it.\n");
+ if (attr->type == PERF_TYPE_HARDWARE && err == EOPNOTSUPP)
+ die("No hardware sampling interrupt available. No APIC? If so then you can boot the kernel with the \"lapic\" boot parameter to force-enable it.\n");
#endif

- die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
- exit(-1);
- }
+ die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
+ exit(-1);
+ }

- h_attr = get_header_attr(attr, counter);
- if (h_attr == NULL)
- die("nomem\n");
+ h_attr = get_header_attr(attr, counter);
+ if (h_attr == NULL)
+ die("nomem\n");
+
+ if (!file_new) {
+ if (memcmp(&h_attr->attr, attr, sizeof(*attr))) {
+ fprintf(stderr, "incompatible append\n");
+ exit(-1);
+ }
+ }

- if (!file_new) {
- if (memcmp(&h_attr->attr, attr, sizeof(*attr))) {
- fprintf(stderr, "incompatible append\n");
+ if (read(fd[nr_cpu][counter][thread_index], &read_data, sizeof(read_data)) == -1) {
+ perror("Unable to read perf file descriptor\n");
exit(-1);
}
- }
-
- if (read(fd[nr_cpu][counter], &read_data, sizeof(read_data)) == -1) {
- perror("Unable to read perf file descriptor\n");
- exit(-1);
- }

- if (perf_header_attr__add_id(h_attr, read_data.id) < 0) {
- pr_warning("Not enough memory to add id\n");
- exit(-1);
- }
+ if (perf_header_attr__add_id(h_attr, read_data.id) < 0) {
+ pr_warning("Not enough memory to add id\n");
+ exit(-1);
+ }

- assert(fd[nr_cpu][counter] >= 0);
- fcntl(fd[nr_cpu][counter], F_SETFL, O_NONBLOCK);
+ assert(fd[nr_cpu][counter][thread_index] >= 0);
+ fcntl(fd[nr_cpu][counter][thread_index], F_SETFL, O_NONBLOCK);

- /*
- * First counter acts as the group leader:
- */
- if (group && group_fd == -1)
- group_fd = fd[nr_cpu][counter];
- if (multiplex && multiplex_fd == -1)
- multiplex_fd = fd[nr_cpu][counter];
+ /*
+ * First counter acts as the group leader:
+ */
+ if (group && group_fd == -1)
+ group_fd = fd[nr_cpu][counter][thread_index];
+ if (multiplex && multiplex_fd == -1)
+ multiplex_fd = fd[nr_cpu][counter][thread_index];

- if (multiplex && fd[nr_cpu][counter] != multiplex_fd) {
+ if (multiplex && fd[nr_cpu][counter][thread_index] != multiplex_fd) {

- ret = ioctl(fd[nr_cpu][counter], PERF_EVENT_IOC_SET_OUTPUT, multiplex_fd);
- assert(ret != -1);
- } else {
- event_array[nr_poll].fd = fd[nr_cpu][counter];
- event_array[nr_poll].events = POLLIN;
- nr_poll++;
-
- mmap_array[nr_cpu][counter].counter = counter;
- mmap_array[nr_cpu][counter].prev = 0;
- mmap_array[nr_cpu][counter].mask = mmap_pages*page_size - 1;
- mmap_array[nr_cpu][counter].base = mmap(NULL, (mmap_pages+1)*page_size,
- PROT_READ|PROT_WRITE, MAP_SHARED, fd[nr_cpu][counter], 0);
- if (mmap_array[nr_cpu][counter].base == MAP_FAILED) {
- error("failed to mmap with %d (%s)\n", errno, strerror(errno));
- exit(-1);
+ ret = ioctl(fd[nr_cpu][counter][thread_index], PERF_EVENT_IOC_SET_OUTPUT, multiplex_fd);
+ assert(ret != -1);
+ } else {
+ event_array[nr_poll].fd = fd[nr_cpu][counter][thread_index];
+ event_array[nr_poll].events = POLLIN;
+ nr_poll++;
+
+ mmap_array[nr_cpu][counter][thread_index].counter = counter;
+ mmap_array[nr_cpu][counter][thread_index].prev = 0;
+ mmap_array[nr_cpu][counter][thread_index].mask = mmap_pages*page_size - 1;
+ mmap_array[nr_cpu][counter][thread_index].base = mmap(NULL,
+ (mmap_pages+1)*page_size,
+ PROT_READ|PROT_WRITE, MAP_SHARED,
+ fd[nr_cpu][counter][thread_index],
+ 0);
+ if (mmap_array[nr_cpu][counter][thread_index].base == MAP_FAILED) {
+ error("failed to mmap with %d (%s)\n", errno, strerror(errno));
+ exit(-1);
+ }
}
- }

- if (filter != NULL) {
- ret = ioctl(fd[nr_cpu][counter],
- PERF_EVENT_IOC_SET_FILTER, filter);
- if (ret) {
- error("failed to set filter with %d (%s)\n", errno,
- strerror(errno));
- exit(-1);
+ if (filter != NULL) {
+ ret = ioctl(fd[nr_cpu][counter][thread_index],
+ PERF_EVENT_IOC_SET_FILTER, filter);
+ if (ret) {
+ error("failed to set filter with %d (%s)\n", errno,
+ strerror(errno));
+ exit(-1);
+ }
}
- }

- ioctl(fd[nr_cpu][counter], PERF_EVENT_IOC_ENABLE);
+ ioctl(fd[nr_cpu][counter][thread_index], PERF_EVENT_IOC_ENABLE);
+ }
}

-static void open_counters(int cpu, pid_t pid, bool forks)
+static void open_counters(int cpu, bool forks)
{
int counter;

group_fd = -1;
for (counter = 0; counter < nr_counters; counter++)
- create_counter(counter, cpu, pid, forks);
+ create_counter(counter, cpu, forks);

nr_cpu++;
}
@@ -425,7 +436,7 @@ static int __cmd_record(int argc, const
int err;
unsigned long waking = 0;
int child_ready_pipe[2], go_pipe[2];
- const bool forks = target_pid == -1 && argc > 0;
+ const bool forks = target_tid == -1 && argc > 0;
char buf;

page_size = sysconf(_SC_PAGE_SIZE);
@@ -534,7 +545,7 @@ static int __cmd_record(int argc, const
child_pid = pid;

if (!system_wide)
- target_pid = pid;
+ target_tid = pid;

close(child_ready_pipe[1]);
close(go_pipe[0]);
@@ -550,11 +561,11 @@ static int __cmd_record(int argc, const

if ((!system_wide && !inherit) || profile_cpu != -1) {
- open_counters(profile_cpu, target_pid, forks);
+ open_counters(profile_cpu, forks);
} else {
nr_cpus = read_cpu_map();
for (i = 0; i < nr_cpus; i++)
- open_counters(cpumap[i], target_pid, forks);
+ open_counters(cpumap[i], forks);
}

if (file_new) {
@@ -579,7 +590,7 @@ static int __cmd_record(int argc, const
}

if (!system_wide && profile_cpu == -1)
- event__synthesize_thread(target_pid, process_synthesized_event,
+ event__synthesize_thread(target_tid, process_synthesized_event,
session);
else
event__synthesize_threads(process_synthesized_event, session);
@@ -602,11 +613,15 @@ static int __cmd_record(int argc, const

for (;;) {
int hits = samples;
+ int thread;

for (i = 0; i < nr_cpu; i++) {
for (counter = 0; counter < nr_counters; counter++) {
- if (mmap_array[i][counter].base)
- mmap_read(&mmap_array[i][counter]);
+ for (thread = 0;
+ thread < thread_num; thread++) {
+ if (mmap_array[i][counter][thread].base)
+ mmap_read(&mmap_array[i][counter][thread]);
+ }
}
}

@@ -619,8 +634,13 @@ static int __cmd_record(int argc, const

if (done) {
for (i = 0; i < nr_cpu; i++) {
- for (counter = 0; counter < nr_counters; counter++)
- ioctl(fd[i][counter], PERF_EVENT_IOC_DISABLE);
+ for (counter = 0; counter < nr_counters;
+ counter++) {
+ for (thread = 0;
+ thread < thread_num; thread++)
+ ioctl(fd[i][counter][thread],
+ PERF_EVENT_IOC_DISABLE);
+ }
}
}
}
@@ -653,6 +673,8 @@ static const struct option options[] = {
"event filter", parse_filter),
OPT_INTEGER('p', "pid", &target_pid,
"record events on existing pid"),
+ OPT_INTEGER('t', "tid", &target_tid,
+ "record events on existing thread id"),
OPT_INTEGER('r', "realtime", &realtime_prio,
"collect data with this RT SCHED_FIFO priority"),
OPT_BOOLEAN('R', "raw-samples", &raw_samples,
@@ -693,10 +715,11 @@ static const struct option options[] = {
int cmd_record(int argc, const char **argv, const char *prefix __used)
{
int counter;
+ int i,j;

argc = parse_options(argc, argv, options, record_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
- if (!argc && target_pid == -1 && !system_wide && profile_cpu == -1)
+ if (!argc && target_tid == -1 && !system_wide && profile_cpu == -1)
usage_with_options(record_usage, options);

symbol__init();
@@ -707,6 +730,37 @@ int cmd_record(int argc, const char **ar
attrs[0].config = PERF_COUNT_HW_CPU_CYCLES;
}

+ if (target_pid != -1) {
+ target_tid = target_pid;
+ thread_num = find_all_tid(target_pid, &all_tids);
+ if (thread_num <= 0) {
+ fprintf(stderr, "Can't find all threads of pid %d\n",
+ target_pid);
+ usage_with_options(record_usage, options);
+ }
+ } else {
+ all_tids=malloc(sizeof(int));
+ if (!all_tids)
+ return -ENOMEM;
+
+ all_tids[0] = target_tid;
+ thread_num = 1;
+ }
+
+ for (i = 0; i < MAX_NR_CPUS; i++) {
+ for (j = 0; j < MAX_COUNTERS; j++) {
+ fd[i][j] = malloc(sizeof(int)*thread_num);
+ mmap_array[i][j] = malloc(
+ sizeof(struct mmap_data)*thread_num);
+ if (!fd[i][j] || !mmap_array[i][j])
+ return -ENOMEM;
+ }
+ }
+ event_array = malloc(
+ sizeof(struct pollfd)*MAX_NR_CPUS*MAX_COUNTERS*thread_num);
+ if (!event_array)
+ return -ENOMEM;
+
/*
* User specified count overrides default frequency.
*/
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-stat.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-stat.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-stat.c 2010-03-16 08:59:54.892460680 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-stat.c 2010-03-17 16:30:25.484062179 +0800
@@ -46,6 +46,7 @@
#include "util/debug.h"
#include "util/header.h"
#include "util/cpumap.h"
+#include "util/thread.h"

#include <sys/prctl.h>
#include <math.h>
@@ -74,10 +75,14 @@ static int run_count = 1;
static int inherit = 1;
static int scale = 1;
static pid_t target_pid = -1;
+static pid_t target_tid = -1;
+static int *all_tids = NULL;
+static int thread_num = 0;
static pid_t child_pid = -1;
static int null_run = 0;
+static bool forks = false;

-static int fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *fd[MAX_NR_CPUS][MAX_COUNTERS];

static int event_scaled[MAX_COUNTERS];

@@ -140,9 +145,10 @@ struct stats runtime_branches_stats;
#define ERR_PERF_OPEN \
"Error: counter %d, sys_perf_event_open() syscall returned with %d (%s)\n"

-static void create_perf_stat_counter(int counter, int pid)
+static void create_perf_stat_counter(int counter)
{
struct perf_event_attr *attr = attrs + counter;
+ int thread;

if (scale)
attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
@@ -152,20 +158,24 @@ static void create_perf_stat_counter(int
unsigned int cpu;

for (cpu = 0; cpu < nr_cpus; cpu++) {
- fd[cpu][counter] = sys_perf_event_open(attr, -1, cpumap[cpu], -1, 0);
- if (fd[cpu][counter] < 0 && verbose)
+ fd[cpu][counter][0] = sys_perf_event_open(attr, -1, cpumap[cpu], -1, 0);
+ if (fd[cpu][counter][0] < 0 && verbose)
fprintf(stderr, ERR_PERF_OPEN, counter,
- fd[cpu][counter], strerror(errno));
+ fd[cpu][counter][0], strerror(errno));
}
} else {
attr->inherit = inherit;
- attr->disabled = 1;
+ if (forks)
+ attr->disabled = 1;
attr->enable_on_exec = 1;

- fd[0][counter] = sys_perf_event_open(attr, pid, -1, -1, 0);
- if (fd[0][counter] < 0 && verbose)
- fprintf(stderr, ERR_PERF_OPEN, counter,
- fd[0][counter], strerror(errno));
+ for (thread = 0; thread < thread_num; thread++) {
+ fd[0][counter][thread] = sys_perf_event_open(attr,
+ all_tids[thread], -1, -1, 0);
+ if (fd[0][counter][thread] < 0 && verbose)
+ fprintf(stderr, ERR_PERF_OPEN, counter,
+ fd[0][counter][thread], strerror(errno));
+ }
}
}

@@ -190,25 +200,29 @@ static void read_counter(int counter)
unsigned int cpu;
size_t res, nv;
int scaled;
- int i;
+ int i, thread;

count[0] = count[1] = count[2] = 0;

nv = scale ? 3 : 1;
for (cpu = 0; cpu < nr_cpus; cpu++) {
- if (fd[cpu][counter] < 0)
- continue;
-
- res = read(fd[cpu][counter], single_count, nv * sizeof(u64));
- assert(res == nv * sizeof(u64));
-
- close(fd[cpu][counter]);
- fd[cpu][counter] = -1;
-
- count[0] += single_count[0];
- if (scale) {
- count[1] += single_count[1];
- count[2] += single_count[2];
+
+ for (thread = 0; thread < thread_num; thread++) {
+ if (fd[cpu][counter][thread] < 0)
+ continue;
+
+ res = read(fd[cpu][counter][thread],
+ single_count, nv * sizeof(u64));
+ assert(res == nv * sizeof(u64));
+
+ close(fd[cpu][counter][thread]);
+ fd[cpu][counter][thread] = -1;
+
+ count[0] += single_count[0];
+ if (scale) {
+ count[1] += single_count[1];
+ count[2] += single_count[2];
+ }
}
}

@@ -251,11 +265,11 @@ static int run_perf_stat(int argc __used
unsigned long long t0, t1;
int status = 0;
int counter;
- int pid = target_pid;
+ int pid = target_tid;
int child_ready_pipe[2], go_pipe[2];
- const bool forks = (target_pid == -1 && argc > 0);
char buf;

+ forks = (target_tid == -1 && argc > 0);
if (!system_wide)
nr_cpus = 1;

@@ -307,10 +321,12 @@ static int run_perf_stat(int argc __used
if (read(child_ready_pipe[0], &buf, 1) == -1)
perror("unable to read pipe");
close(child_ready_pipe[0]);
+
+ all_tids[0] = pid;
}

for (counter = 0; counter < nr_counters; counter++)
- create_perf_stat_counter(counter, pid);
+ create_perf_stat_counter(counter);

/*
* Enable counters and exec the command:
@@ -321,7 +337,7 @@ static int run_perf_stat(int argc __used
close(go_pipe[1]);
wait(&status);
} else {
- while(!done);
+ while(!done) sleep(1);
}

t1 = rdclock();
@@ -429,12 +445,12 @@ static void print_stat(int argc, const c

fprintf(stderr, "\n");
fprintf(stderr, " Performance counter stats for ");
- if(target_pid == -1) {
+ if(target_tid == -1) {
fprintf(stderr, "\'%s", argv[0]);
for (i = 1; i < argc; i++)
fprintf(stderr, " %s", argv[i]);
}else
- fprintf(stderr, "task pid \'%d", target_pid);
+ fprintf(stderr, "task pid \'%d", target_tid);

fprintf(stderr, "\'");
if (run_count > 1)
@@ -459,7 +475,7 @@ static volatile int signr = -1;

static void skip_signal(int signo)
{
- if(target_pid != -1)
+ if(target_tid != -1)
done = 1;

signr = signo;
@@ -488,8 +504,10 @@ static const struct option options[] = {
parse_events),
OPT_BOOLEAN('i', "inherit", &inherit,
"child tasks inherit counters"),
- OPT_INTEGER('p', "pid", &target_pid,
+ OPT_INTEGER('p', "pid", &target_tid,
"stat events on existing pid"),
+ OPT_INTEGER('t', "tid", &target_tid,
+ "stat events on existing tid"),
OPT_BOOLEAN('a', "all-cpus", &system_wide,
"system-wide collection from all CPUs"),
OPT_BOOLEAN('c', "scale", &scale,
@@ -506,10 +524,11 @@ static const struct option options[] = {
int cmd_stat(int argc, const char **argv, const char *prefix __used)
{
int status;
+ int i,j;

argc = parse_options(argc, argv, options, stat_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
- if (!argc && target_pid == -1)
+ if (!argc && (target_pid == -1 || target_tid == -1))
usage_with_options(stat_usage, options);
if (run_count <= 0)
usage_with_options(stat_usage, options);
@@ -525,6 +544,32 @@ int cmd_stat(int argc, const char **argv
else
nr_cpus = 1;

+ if (target_pid != -1) {
+ target_tid = target_pid;
+ thread_num = find_all_tid(target_pid, &all_tids);
+ if (thread_num <= 0) {
+ fprintf(stderr, "Can't find all threads of pid %d\n",
+ target_pid);
+ usage_with_options(stat_usage, options);
+ }
+
+ } else {
+ all_tids=malloc(sizeof(int));
+ if (!all_tids)
+ return -ENOMEM;
+
+ all_tids[0] = target_tid;
+ thread_num = 1;
+ }
+
+ for (i = 0; i < MAX_NR_CPUS; i++) {
+ for (j = 0; j < MAX_COUNTERS; j++) {
+ fd[i][j] = malloc(sizeof(int)*thread_num);
+ if (!fd[i][j])
+ return -ENOMEM;
+ }
+ }
+
/*
* We dont want to block the signals - that would cause
* child tasks to inherit that and Ctrl-C would not work.
diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-top.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-top.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-top.c 2010-03-16 08:59:54.760470652 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-top.c 2010-03-17 16:30:35.316716557 +0800
@@ -55,7 +55,8 @@
#include <linux/unistd.h>
#include <linux/types.h>

-static int fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int *all_tids = NULL;

static int system_wide = 0;

@@ -64,7 +65,9 @@ static int default_interval = 0;
static int count_filter = 5;
static int print_entries;

+static int target_tid = -1;
static int target_pid = -1;
+static int thread_num = 0;
static int inherit = 0;
static int profile_cpu = -1;
static int nr_cpus = 0;
@@ -524,13 +527,15 @@ static void print_sym_table(void)

if (target_pid != -1)
printf(" (target_pid: %d", target_pid);
+ else if (target_tid != -1)
+ printf(" (target_tid: %d", target_tid);
else
printf(" (all");

if (profile_cpu != -1)
printf(", cpu: %d)\n", profile_cpu);
else {
- if (target_pid != -1)
+ if (target_tid != -1)
printf(")\n");
else
printf(", %d CPUs)\n", nr_cpus);
@@ -1124,16 +1129,21 @@ static void perf_session__mmap_read_coun
md->prev = old;
}

-static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS];
-static struct mmap_data mmap_array[MAX_NR_CPUS][MAX_COUNTERS];
+static struct pollfd *event_array;
+static struct mmap_data *mmap_array[MAX_NR_CPUS][MAX_COUNTERS];

static void perf_session__mmap_read(struct perf_session *self)
{
- int i, counter;
+ int i, counter, thread_index;

for (i = 0; i < nr_cpus; i++) {
for (counter = 0; counter < nr_counters; counter++)
- perf_session__mmap_read_counter(self, &mmap_array[i][counter]);
+ for (thread_index = 0;
+ thread_index < thread_num;
+ thread_index++) {
+ perf_session__mmap_read_counter(self,
+ &mmap_array[i][counter][thread_index]);
+ }
}
}

@@ -1144,9 +1154,10 @@ static void start_counter(int i, int cou
{
struct perf_event_attr *attr;
int cpu;
+ int thread_index;

cpu = profile_cpu;
- if (target_pid == -1 && profile_cpu == -1)
+ if (target_tid == -1 && profile_cpu == -1)
cpu = cpumap[i];

attr = attrs + counter;
@@ -1162,55 +1173,58 @@ static void start_counter(int i, int cou
attr->inherit = (cpu < 0) && inherit;
attr->mmap = 1;

+ for (thread_index = 0; thread_index < thread_num; thread_index++) {
try_again:
- fd[i][counter] = sys_perf_event_open(attr, target_pid, cpu, group_fd, 0);
+ fd[i][counter][thread_index] = sys_perf_event_open(attr,
+ all_tids[thread_index], cpu, group_fd, 0);
+
+ if (fd[i][counter][thread_index] < 0) {
+ int err = errno;

- if (fd[i][counter] < 0) {
- int err = errno;
+ if (err == EPERM || err == EACCES)
+ die("No permission - are you root?\n");
+ /*
+ * If it's cycles then fall back to hrtimer
+ * based cpu-clock-tick sw counter, which
+ * is always available even if no PMU support:
+ */
+ if (attr->type == PERF_TYPE_HARDWARE
+ && attr->config == PERF_COUNT_HW_CPU_CYCLES) {
+
+ if (verbose)
+ warning(" ... trying to fall back to cpu-clock-ticks\n");
+
+ attr->type = PERF_TYPE_SOFTWARE;
+ attr->config = PERF_COUNT_SW_CPU_CLOCK;
+ goto try_again;
+ }
+ printf("\n");
+ error("perfcounter syscall returned with %d (%s)\n",
+ fd[i][counter][thread_index], strerror(err));
+ die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
+ exit(-1);
+ }
+ assert(fd[i][counter][thread_index] >= 0);
+ fcntl(fd[i][counter][thread_index], F_SETFL, O_NONBLOCK);

- if (err == EPERM || err == EACCES)
- die("No permission - are you root?\n");
/*
- * If it's cycles then fall back to hrtimer
- * based cpu-clock-tick sw counter, which
- * is always available even if no PMU support:
+ * First counter acts as the group leader:
*/
- if (attr->type == PERF_TYPE_HARDWARE
- && attr->config == PERF_COUNT_HW_CPU_CYCLES) {
+ if (group && group_fd == -1)
+ group_fd = fd[i][counter][thread_index];

- if (verbose)
- warning(" ... trying to fall back to cpu-clock-ticks\n");
-
- attr->type = PERF_TYPE_SOFTWARE;
- attr->config = PERF_COUNT_SW_CPU_CLOCK;
- goto try_again;
- }
- printf("\n");
- error("perfcounter syscall returned with %d (%s)\n",
- fd[i][counter], strerror(err));
- die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
- exit(-1);
+ event_array[nr_poll].fd = fd[i][counter][thread_index];
+ event_array[nr_poll].events = POLLIN;
+ nr_poll++;
+
+ mmap_array[i][counter][thread_index].counter = counter;
+ mmap_array[i][counter][thread_index].prev = 0;
+ mmap_array[i][counter][thread_index].mask = mmap_pages*page_size - 1;
+ mmap_array[i][counter][thread_index].base = mmap(NULL, (mmap_pages+1)*page_size,
+ PROT_READ, MAP_SHARED, fd[i][counter][thread_index], 0);
+ if (mmap_array[i][counter][thread_index].base == MAP_FAILED)
+ die("failed to mmap with %d (%s)\n", errno, strerror(errno));
}
- assert(fd[i][counter] >= 0);
- fcntl(fd[i][counter], F_SETFL, O_NONBLOCK);
-
- /*
- * First counter acts as the group leader:
- */
- if (group && group_fd == -1)
- group_fd = fd[i][counter];
-
- event_array[nr_poll].fd = fd[i][counter];
- event_array[nr_poll].events = POLLIN;
- nr_poll++;
-
- mmap_array[i][counter].counter = counter;
- mmap_array[i][counter].prev = 0;
- mmap_array[i][counter].mask = mmap_pages*page_size - 1;
- mmap_array[i][counter].base = mmap(NULL, (mmap_pages+1)*page_size,
- PROT_READ, MAP_SHARED, fd[i][counter], 0);
- if (mmap_array[i][counter].base == MAP_FAILED)
- die("failed to mmap with %d (%s)\n", errno, strerror(errno));
}

static int __cmd_top(void)
@@ -1226,8 +1240,8 @@ static int __cmd_top(void)
if (session == NULL)
return -ENOMEM;

- if (target_pid != -1)
- event__synthesize_thread(target_pid, event__process, session);
+ if (target_tid != -1)
+ event__synthesize_thread(target_tid, event__process, session);
else
event__synthesize_threads(event__process, session);

@@ -1238,7 +1252,7 @@ static int __cmd_top(void)
}

/* Wait for a minimal set of events before starting the snapshot */
- poll(event_array, nr_poll, 100);
+ poll(&event_array[0], nr_poll, 100);

perf_session__mmap_read(session);

@@ -1282,6 +1296,8 @@ static const struct option options[] = {
"event period to sample"),
OPT_INTEGER('p', "pid", &target_pid,
"profile events on existing pid"),
+ OPT_INTEGER('t', "tid", &target_tid,
+ "profile events on existing tid"),
OPT_BOOLEAN('a', "all-cpus", &system_wide,
"system-wide collection from all CPUs"),
OPT_INTEGER('C', "CPU", &profile_cpu,
@@ -1322,6 +1338,7 @@ static const struct option options[] = {
int cmd_top(int argc, const char **argv, const char *prefix __used)
{
int counter;
+ int i,j;

page_size = sysconf(_SC_PAGE_SIZE);

@@ -1329,8 +1346,39 @@ int cmd_top(int argc, const char **argv,
if (argc)
usage_with_options(top_usage, options);

+ if (target_pid != -1) {
+ target_tid = target_pid;
+ thread_num = find_all_tid(target_pid, &all_tids);
+ if (thread_num <= 0) {
+ fprintf(stderr, "Can't find all threads of pid %d\n",
+ target_pid);
+ usage_with_options(top_usage, options);
+ }
+ } else {
+ all_tids=malloc(sizeof(int));
+ if (!all_tids)
+ return -ENOMEM;
+
+ all_tids[0] = target_tid;
+ thread_num = 1;
+ }
+
+ for (i = 0; i < MAX_NR_CPUS; i++) {
+ for (j = 0; j < MAX_COUNTERS; j++) {
+ fd[i][j] = malloc(sizeof(int)*thread_num);
+ mmap_array[i][j] = malloc(
+ sizeof(struct mmap_data)*thread_num);
+ if (!fd[i][j] || !mmap_array[i][j])
+ return -ENOMEM;
+ }
+ }
+ event_array = malloc(
+ sizeof(struct pollfd)*MAX_NR_CPUS*MAX_COUNTERS*thread_num);
+ if (!event_array)
+ return -ENOMEM;
+
/* CPU and PID are mutually exclusive */
- if (target_pid != -1 && profile_cpu != -1) {
+ if (target_tid > 0 && profile_cpu != -1) {
printf("WARNING: PID switch overriding CPU\n");
sleep(1);
profile_cpu = -1;
@@ -1371,7 +1419,7 @@ int cmd_top(int argc, const char **argv,
attrs[counter].sample_period = default_interval;
}

- if (target_pid != -1 || profile_cpu != -1)
+ if (target_tid != -1 || profile_cpu != -1)
nr_cpus = 1;
else
nr_cpus = read_cpu_map();
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/thread.c linux-2.6_tipmaster0315_perfpid/tools/perf/util/thread.c
--- linux-2.6_tipmaster0315/tools/perf/util/thread.c 2010-03-16 08:59:54.892460680 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/util/thread.c 2010-03-17 14:07:25.725218425 +0800
@@ -7,6 +7,37 @@
#include "util.h"
#include "debug.h"

+int find_all_tid(int pid, int ** all_tid)
+{
+ char name[256];
+ int items;
+ struct dirent **namelist = NULL;
+ int ret = 0;
+ int i;
+
+ sprintf(name, "/proc/%d/task", pid);
+ items = scandir(name, &namelist, NULL, NULL);
+ if (items <= 0)
+ return -ENOENT;
+ *all_tid = malloc(sizeof(int) * items);
+ if (!*all_tid) {
+ ret = -ENOMEM;
+ goto failure;
+ }
+
+ for (i = 0; i < items; i++)
+ (*all_tid)[i] = atoi(namelist[i]->d_name);
+
+ ret = items;
+
+failure:
+ for (i=0; i<items; i++)
+ free(namelist[i]);
+ free(namelist);
+
+ return ret;
+}
+
void map_groups__init(struct map_groups *self)
{
int i;
@@ -348,3 +379,4 @@ struct symbol *map_groups__find_symbol(s

return NULL;
}
+
diff -Nraup linux-2.6_tipmaster0315/tools/perf/util/thread.h linux-2.6_tipmaster0315_perfpid/tools/perf/util/thread.h
--- linux-2.6_tipmaster0315/tools/perf/util/thread.h 2010-03-16 08:59:54.764469663 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/util/thread.h 2010-03-17 14:03:09.628322688 +0800
@@ -23,6 +23,7 @@ struct thread {
int comm_len;
};

+int find_all_tid(int pid, int ** all_tid);
void map_groups__init(struct map_groups *self);
int thread__set_comm(struct thread *self, const char *comm);
int thread__comm_len(struct thread *self);

Avi Kivity

unread,

Mar 17, 2010, 5:50:02 AM3/17/10

to

On 03/17/2010 11:28 AM, Sheng Yang wrote:
>
>> I'm not sure if vmexit does break NMI context or not. Hardware NMI context
>> isn't reentrant till a IRET. YangSheng would like to double check it.
>>
> After more check, I think VMX won't remained NMI block state for host. That's
> means, if NMI happened and processor is in VMX non-root mode, it would only
> result in VMExit, with a reason indicate that it's due to NMI happened, but no
> more state change in the host.
>
> So in that meaning, there _is_ a window between VMExit and KVM handle the NMI.
> Moreover, I think we _can't_ stop the re-entrance of NMI handling code because
> "int $2" don't have effect to block following NMI.
>

That's pretty bad, as NMI runs on a separate stack (via IST). So if
another NMI happens while our int $2 is running, the stack will be
corrupted.

> And if the NMI sequence is not important(I think so), then we need to generate
> a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to
> itself is a good idea.
>
> I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to replace
> "int $2". Something unexpected is happening...
>

I think you need DM_NMI for that to work correctly.

An alternative is to call the NMI handler directly.

--
error compiling committee.c: too many arguments to function

Sheng Yang

unread,

Mar 17, 2010, 6:00:02 AM3/17/10

to

On Wednesday 17 March 2010 17:41:58 Avi Kivity wrote:
> On 03/17/2010 11:28 AM, Sheng Yang wrote:
> >> I'm not sure if vmexit does break NMI context or not. Hardware NMI
> >> context isn't reentrant till a IRET. YangSheng would like to double
> >> check it.
> >
> > After more check, I think VMX won't remained NMI block state for host.
> > That's means, if NMI happened and processor is in VMX non-root mode, it
> > would only result in VMExit, with a reason indicate that it's due to NMI
> > happened, but no more state change in the host.
> >
> > So in that meaning, there _is_ a window between VMExit and KVM handle the
> > NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling
> > code because "int $2" don't have effect to block following NMI.
>
> That's pretty bad, as NMI runs on a separate stack (via IST). So if
> another NMI happens while our int $2 is running, the stack will be
> corrupted.

Though hardware didn't provide this kind of block, software at least would
warn about it... nmi_enter() still would be executed by "int $2", and result
in BUG() if we are already in NMI context(OK, it is a little better than
mysterious crash due to corrupted stack).

>
> > And if the NMI sequence is not important(I think so), then we need to
> > generate a real NMI in current vmexit-after code. Seems let APIC send a
> > NMI IPI to itself is a good idea.
> >
> > I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to
> > replace "int $2". Something unexpected is happening...
>
> I think you need DM_NMI for that to work correctly.
>
> An alternative is to call the NMI handler directly.

apic_send_IPI_self() already took care of APIC_DM_NMI.

And NMI handler would block the following NMI?

--
regards
Yang, Sheng

Avi Kivity

unread,

Mar 17, 2010, 6:10:02 AM3/17/10

to

On 03/17/2010 11:51 AM, Sheng Yang wrote:
>
>> I think you need DM_NMI for that to work correctly.
>>
>> An alternative is to call the NMI handler directly.
>>
> apic_send_IPI_self() already took care of APIC_DM_NMI.
>

So it does (though not for x2apic?). I don't see why it doesn't work.

> And NMI handler would block the following NMI?
>
>

It wouldn't - won't work without extensive changes.

--
error compiling committee.c: too many arguments to function

--

Zachary Amsden

unread,

Mar 17, 2010, 5:20:02 PM3/17/10

to

You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't
supposed to be able to.

Zach

Sheng Yang

unread,

Mar 17, 2010, 9:30:01 PM3/17/10

to

Um? Why?

Especially kernel is already using it to deliver NMI.

--
regards
Yang, Sheng

Zhang, Yanmin

unread,

Mar 17, 2010, 10:50:02 PM3/17/10

to

Ingo,

Sorry, the patch has bugs. I need do a better job and will work out 2
separate patches against the 2 issues.

Yanmin

Zachary Amsden

unread,

Mar 18, 2010, 1:00:02 AM3/18/10

to

That's the only defined case, and it is defined because the vector field
is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending
on your version).

8.5.1 / 8.6.1

'100 (NMI) Delivers an NMI interrupt to the target processor or
processors. The vector information is ignored'

8.5.2 Valid Interrupt Vectors

'Local and I/O APICs support 240 of these vectors (in the range of 16 to
255) as valid interrupts.'

8.8.4 Interrupt Acceptance for Fixed Interrupts

'...; vectors 0 through 15 are reserved by the APIC (see also: Section
8.5.2, "Valid Interrupt Vectors")'

So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but
vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.

Zach

Sheng Yang

unread,

Mar 18, 2010, 1:30:02 AM3/18/10

to

As you pointed out, NMI is not "Fixed interrupt". If we want to send NMI, it
would need a specific delivery mode rather than vector number.

And if you look at code, if we specific NMI_VECTOR, the delivery mode would be
set to NMI.

So what's wrong here?

--
regards
Yang, Sheng

Huang, Zhiteng

unread,

Mar 18, 2010, 1:30:02 AM3/18/10

to

Hi Avi, Ingo,

I've been following through this long thread since the very first email.

I'm a performance engineer whose job is to tune workloads run on top of KVM (and Xen previously). As a performance engineer, I desperately want to have a tool that can monitor the host and guests at same time. Think about >100 guests mixed with Linux/Windows running together on single system, being able to know what's happening is critical to do performance analysis. Actually I am the person asked Yanmin to add feature for CPU utilization break down (into host_usr, host_krn, guest_usr, guest_krn) so that I can monitor dozens of running guests. I hasn't made this patch work on my system yet but I _do_ think this patch is a very good start.

And finally, monitoring guests from host is useful for users too (administrator and performance guy like me). I really appreciate you guys' work and would love to provide feedback from my point of view if needed.

Regards,

HUANG, Zhiteng

Intel SSG/SSD/SPA/PRC Scalability Lab

To unsubscribe from this list: send the line "unsubscribe kvm" in

Sheng Yang

unread,

Mar 18, 2010, 1:50:02 AM3/18/10

to

OK, I think I understand your points now. You meant that these vectors can't
be filled in vector field directly, right? But NMI is a exception due to
DM_NMI. Is that your point? I think we agree on this.

Zhang, Yanmin

unread,

Mar 18, 2010, 3:50:02 AM3/18/10

to

I worked out 3 new patches against tip/master tree of Mar. 17th.

1) Patch perf_stat: Fix the issue that perf doesn't enable counters when
target_pid != -1. Change the condition to fork/exec subcommand. If there
is a subcommand parameter, perf always fork/exec it. The usage example is:
#perf stat -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand,
it should enable all counters after the new process is execing.Change the
condition to fork/exec subcommand. If there is a subcommand parameter,
perf always fork/exec it. The usage example is:
#perf record -f -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

3) perf_pid: Change parameter --pid to process-wide collection. Add --tid
which means collecting thread-wide statistics. Usage example is:
#perf top -p 8888
#perf record -p 8888 -f sleep 10
#perf stat -p 8888 -f sleep 10

Arnaldo,

Pls. apply the 3 attached patches.

Yanmin

perf_stat_2.6_tipmaster0317_v02.patch

perf_record_2.6_tipmaster0317_v02.patch

perf_pid_2.6_tip0317_v06.patch

Ingo Molnar

unread,

Mar 18, 2010, 4:10:02 AM3/18/10

to

* Zhang, Yanmin <yanmin...@linux.intel.com> wrote:

> I worked out 3 new patches against tip/master tree of Mar. 17th.

Cool! Mind sending them as a series of patches instead of attachment? That
makes it easier to review them. Also, the Signed-off-by lines seem to be
missing plus we need a per patch changelog as well.

Thanks,

Ingo

Avi Kivity

unread,

Mar 18, 2010, 4:30:02 AM3/18/10

to

On 03/17/2010 10:10 AM, Ingo Molnar wrote:
>
>> It's about who owns the user interface.
>>
>> If qemu owns the user interface, than we can satisfy this in a very simple
>> way by adding a perf monitor command. If we have to support third party
>> tools, then it significantly complicates things.
>>
> Of course illogical modularization complicates things 'significantly'.
>

Who should own the user interface then?

> Fast forward to 2010. The kernel side of KVM is maximum goodness - by far the
> worst-quality remaining aspects of KVM are precisely in areas that you
> mention: 'if we have to support third party tools, then it significantly
> complicates things'. You kept Qemu as an external 'third party' entity to KVM,
> and KVM is clearly hurting from that - just see the recent KVM usability
> thread for examples about suckage.
>

Any qemu usability problems are because developers (or their employers)
are not interested in fixing them, not because of the repository
location. Most kvm developer interest is in server-side deployment
(even for desktop guests), so there is limited effort in implementing a
virtualbox-style GUI.

> - move a clean (and minimal) version of the Qemu code base to tools/kvm/, in
> the upstream kernel repo, and work on that from that point on.
>

I'll ignore the repository location which should be immaterial to a
serious developer and concentrate on the 'clean and minimal' aspect,
since it has some merit. Qemu development does have a tension between
the needs of kvm and tcg. For kvm we need fine-grained threading to
improve performance and tons of RAS work. For tcg these are mostly
meaningless, and the tcg code has sufficient inertia to reduce the rate
at which we can develop.

Nevertheless, the majority of developers feel that we'll lose more by a
fork (the community) than we gain by it (reduced constraints).

> - co-develop new features within the same patch. Release new versions of
> kvm-qemu and the kvm bits at the same time (together with the upstream
> kernel), at well defined points in time.
>

The majority of patches to qemu don't require changes to kvm, and vice
versa. The interface between qemu and kvm is fairly narrow, and most of
the changes are related to save/restore and guest debugging, hardly
areas of great interest to the causal user.

> - encourage kernel-space and user-space KVM developers to work on both
> user-space and kernel-space bits as a single unit. It's one project and a
> single experience to the user.
>

When a feature is developed that requires both kernel and qemu changes,
the same developer makes the changes in both projects. Having them in
different repositories does not appear to be a problem.

> - [ and probably libvirt should go there too ]
>

Let's make a list of projects who don't need to be in the kernel
repository, it will probably be shorted.

Seriously, libvirt is a cross-platform cross-hypervisor library, it has
no business near the Linux kernel.

> If KVM's hypervisor and guest kernel code can enjoy the benefits of a single
> repository,

In fact I try hard not to rely too much on that. While both kvm guest
and host code are in the same repo, there is an ABI barrier between them
because we need to support any guest version on any host version. When
designing, writing, or reading guest or host code that interacts across
that barrier we need to keep forward and backward compatibility in
mind. It's very different from normal kernel APIs that we can adapt
whenever the need arises.

> why cannot the rest of KVM enjoy the same developer goodness? Only
> fixing that will bring the break-through in quality - not more manpower
> really.
>

I really don't understand why you believe that. You seem to want a
virtualbox-style GUI, and lkml is probably the last place in the world
to develop something like that. The developers here are mostly
uninterested in GUI and usability problems, remember these are people
who thing emacs xor vi is a great editor.

> Yes, i've read a thousand excuses for why this is an absolutely impossible and
> a bad thing to do, and none of them was really convincing to me - and you also
> have become rather emotional about all the arguments so it's hard to argue
> about it on a technical basis.
>
> We made a similar (admittedly very difficult ...) design jump from oprofile to
> perf, and i can tell you from that experience that it's day and night, both in
> terms of development and in terms of the end result!
>

Maybe it was due to better design and implementation choices.

> ( We recently also made another, kernel/kernel unification that had a very
> positive result: we unified the 32-bit and 64-bit x86 architectures. Even
> within the same repo the unification of technology is generally a good
> thing. The KVM/Qemu situation is different - it's more similar to the perf
> design. )
>
> Not having to fight artificial package boundaries and forced package
> separation is very refreshing experience to a developer - and very rewarding
> and flexible to develop on. ABI compatibility is _easier_ to maintain in such
> a model. It's quite similar to the jump from Xen hacking to KVM hacking (i did
> both). It's a bit like the jump from CVS to Git. Trust me, you _cannot_ know
> the difference if you havent tried a similar jump with Qemu.
>

Why is ABI compatibility easier to maintain in a single repo?

> Anyway, you made your position about this rather clear and you are clearly
> uncompromising, so i just wanted to post this note to the list: you'll waste
> years of your life on a visibly crappy development model that has been unable
> to break through a magic usability barrier for the past 2-3 years - just like
> the Xen mis-design has wasted so many people's time and effort in kernel
> space.
>

Do you really think the echo'n'cat school of usability wants to write a
GUI? In linux-2.6.git?

--
error compiling committee.c: too many arguments to function

--

Zachary Amsden

unread,

Mar 18, 2010, 4:50:02 AM3/18/10

to

Yes, I think we agree. NMI is the only vector in 0x0-0xf which can be
sent via self-IPI because the vector itself does not matter for NMI.

Zach

Jes Sorensen

unread,

Mar 18, 2010, 4:50:01 AM3/18/10

to

On 03/17/10 09:10, Ingo Molnar wrote:
> I wish both you and Avi looked back 3-4 years and realized what made KVM so
> successful back then and why the hearts and minds of virtualization developers
> were captured by KVM almost overnight.

Ingo,

What made KVM so successful was that the core kernel of the hypervisor
was designed the right way, as a kernel module where it belonged. It was
obvious to anyone who had been exposed to the main competition at the
time, Xen, that this was the right approach. What has ended up killing
Xen in the end is the not-invented-here approach of copying everything
over, reformatting it, and rewriting half of it, which made it
impossible to maintain and support as a single codebase. At my previous
employer we ended up dropping all Xen efforts exactly because it was
like maintaining two separate operating system kernels. The key to KVM
is that once you have Linux, you practically have KVM as well.

> Fast forward to 2010. The kernel side of KVM is maximum goodness - by far the
> worst-quality remaining aspects of KVM are precisely in areas that you
> mention: 'if we have to support third party tools, then it significantly
> complicates things'. You kept Qemu as an external 'third party' entity to KVM,
> and KVM is clearly hurting from that - just see the recent KVM usability
> thread for examples about suckage.
>
> So a similar 'complication' is the crux of the matter behind KVM quality
> problems: you've not followed through with the original KVM vision and you
> have not applied that concept to Qemu!

Well there are two ways to go about this. Either you base the KVM
userland on top of an existing project, like QEMU, _or_ you rewrite it
all from scratch. However, there is far more to it than just a couple of
ioctls, for example the stack of reverse device-drivers is a pretty
significant code base, rewriting that and maintaining it is not a
trivial task. It is certainly my belief that the benefit we get from
sharing that with QEMU by far outweighs the cost of forking it and
keeping our own fork in the kernel tree. In fact it would result in
exactly the same problems I mentioned above wrt Xen.

> If you want to jump to the next level of technological quality you need to fix
> this attitude and you need to go back to the design roots of KVM. Concentrate
> on Qemu (as that is the weakest link now), make it a first class member of the
> KVM repo and simplify your development model by having a single repo:
>
> - move a clean (and minimal) version of the Qemu code base to tools/kvm/, in
> the upstream kernel repo, and work on that from that point on.

With this you have just thrown away all the benefits of having the QEMU
repository shared with other developers who will actively fix bugs in
components we do care about for KVM.

> - encourage kernel-space and user-space KVM developers to work on both
> user-space and kernel-space bits as a single unit. It's one project and a
> single experience to the user.

This is already happening and a total non issue.

> - [ and probably libvirt should go there too ]

Now that would be interesting, next we'll have to include things like
libxml in the kernel git tree as well, to make sure libvirt doesn't get
out of sync with the version supplied by your distribution vendor.

> Yes, i've read a thousand excuses for why this is an absolutely impossible and
> a bad thing to do, and none of them was really convincing to me - and you also
> have become rather emotional about all the arguments so it's hard to argue
> about it on a technical basis.

So far your argument would justify pulling all of gdb into the kernel
git tree as well, to support the kgdb efforts, or gcc so we can get rid
of the gcc version quirks in the kernel header files, e2fsprogs and
equivalent for _all_ file systems included in the kernel so we can make
sure our fs tools never get out of sync with whats supported in the
kernel......

> We made a similar (admittedly very difficult ...) design jump from oprofile to
> perf, and i can tell you from that experience that it's day and night, both in
> terms of development and in terms of the end result!

The user components for perf vs oprofile are _tiny_ projects compared to
the portions of QEMU that are actually used by KVM.

Oh and you completely forgot SeaBIOS. KVM+QEMU rely on SeaBIOS too, so
from what you're saying we should pull that into the kernel git
repository as well. Never mind the fact that we share SeaBIOS with the
coreboot project which is very actively adding features to it that
benefit us as well.....

Sorry, but there are times when unification make sense, and there are
times where having a reasonably well designed split makes sense. KVM
had problems with QEMU in the past which resulted in the qemu-kvm branch
of it, which proved to be a major pain to deal with, but that is
fortunately improving and qemu-kvm should go away completely at some
point.

Cheers,
Jes

Ingo Molnar

unread,

Mar 18, 2010, 5:00:01 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/17/2010 10:10 AM, Ingo Molnar wrote:
> >
> >> It's about who owns the user interface.
> >>
> >> If qemu owns the user interface, than we can satisfy this in a very
> >> simple way by adding a perf monitor command. If we have to support third
> >> party tools, then it significantly complicates things.
> >
> > Of course illogical modularization complicates things 'significantly'.
>
> Who should own the user interface then?

If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or
series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and
tools/perf/.

Numerous times did we have patches to kernel/perf_event.c that fixed some
detail, also accompanied by a tools/perf/ patch fixing another detail. Having
a single 'culture of contribution' is a powerful way to develop.

It turns out kernel developers can be pretty good user-space developers as
well and user-space developers can be pretty good kernel developers as well.
Some like to do both - as long as it's all within a single project.

The moment any change (be it as trivial as fixing a GUI detail or as complex
as a new feature) involves two or more packages, development speed slows down
to a crawl - while the complexity of the change might be very low!

Also, there's the harmful process that people start categorizing themselves
into 'I am a kernel developer' and 'I am a user space programmer' stereotypes,
which limits the scope of contributions artificially.

> > Fast forward to 2010. The kernel side of KVM is maximum goodness - by far
> > the worst-quality remaining aspects of KVM are precisely in areas that you
> > mention: 'if we have to support third party tools, then it significantly
> > complicates things'. You kept Qemu as an external 'third party' entity to
> > KVM, and KVM is clearly hurting from that - just see the recent KVM
> > usability thread for examples about suckage.
>
> Any qemu usability problems are because developers (or their employers) are
> not interested in fixing them, not because of the repository location. Most
> kvm developer interest is in server-side deployment (even for desktop
> guests), so there is limited effort in implementing a virtualbox-style GUI.

The same has been said of oprofile as well: 'it somewhat sucks because we are
too server centric', 'nobody is interested in good usability and oprofile is
fine for the enterprises'. Ironically, the same has been said of Xen usability
as well, up to the point KVM came around.

What was the core of the problem was a bad design and a split kernel-side
user-side tool landscape.

In fact i think saying that 'our developers only care about the server' is
borderline dishonest, when at the same time you are making it doubly sure (by
inaction) that it stays so: by leaving an artificial package wall between
kernel-side KVM and user-side KVM and not integrating the two technologies.

You'll never know what heights you could achieve if you leave that wall there
...

Furthermore, what should be realized is that bad usability hurts "server
features" just as much. Most of the day-to-day testing is done on the desktop
by desktop oriented testers/developers. _Not_ by enterprise shops - they tend
to see the code years down the line to begin with ...

Yes, a particular feature might be server oriented, but a good portion of our
testing is on the desktop and everyone is hurting from bad usability and this
puts limits on contribution efficiency.

As the patch posted in _this very thread demonstrates it_, it is doubly more
difficult to contribute a joint KVM+Qemu feature, because it's two separate
code bases, two contribution guidelines, two release schedules. While to the
user it really is just one and the same thing. It should be so for the
developer as well.

Put in another way: KVM's current split design is making it easy to contribute
server features (because the kernel side is clean and cool), but also makes it
artificially hard to contribute desktop features: because the tooling side
(Qemu) is 'just another package', is separated by a package and maintenance
wall and is made somewhat uncool by a (as some KVM developers have pointed out
in this thread) quirky codebase.

(the rest of your points are really a function of this fundamental
disagreement)

Thanks,

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 5:30:03 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> > - move a clean (and minimal) version of the Qemu code base to tools/kvm/,
> > in the upstream kernel repo, and work on that from that point on.
>
> I'll ignore the repository location which should be immaterial to a serious
> developer and concentrate on the 'clean and minimal' aspect, since it has

> some merit. [...]

To the contrary, experience shows that repository location, and in particular
a shared repository for closely related bits is very much material!

It matters because when there are two separate projects, even a "serious
developer" is finding it double and triple difficult to contribute even
trivial changes.

It becomes literally a nightmare if you have to touch 3 packages: kernel, a
library and an app codebase. It takes _forever_ to get anything useful done.

Also, 'focus on a single thing' is a very basic aspect of humans, especially
those who do computer programming. Working on two code bases in two
repositories at once can be very challenging physically and psychically.

So what i've seen is that OSS programmers tend to pick a side, pretty much
randomly, and then rationalize it in hindsight why they prefer that side ;-)

Most of them become either a kernel developer or a user-space package
developer - and then they specialize on that field and shy away from changes
that involve both. It's a basic human thing to avoid the hassle that comes
with multi-package changes. (One really has to be outright stupid, fanatic or
desperate to even attempt such changes these days - such are the difficulties
for a comparatively low return.)

The solution is to tear down such artificial walls of contribution where
possible. And tearing down the wall between KVM and qemu-kvm seems very much
possible and the advantages would be numerous.

Unless by "serious developer" you meant: "developer willing to [or forced to]
waste time and effort on illogically structured technology".

> [...]

>
> Do you really think the echo'n'cat school of usability wants to write a GUI?
> In linux-2.6.git?

Then you'll be surprised to hear that it's happening as we speak and the
commits are there in linux-2.6.git. Both a TUI and GUI is in the works.

Furthermore, the numbers show that half of the usability fixes to tools/perf/
came not from regular perf contributors but from random kernel developers and
testers who when they build the latest kernel and try out perf at the same
time (it's very easy because you already have it in the kernel repository - no
separate download, no installation, etc. necessary).

I had literally zero such contributions when (the precursor to) 'perf' was
still a separate user-space project.

You could have the same effect for Qemu: the latest bits in tools/kvm/ would
be built by regular kernel testers and developers. The integration benefits
dont just extend to developers, a unified project is vastly easier to test as
well.

Thanks,

Ingo

Alexander Graf

unread,

Mar 18, 2010, 5:30:02 AM3/18/10

to

On 18.03.2010, at 09:56, Ingo Molnar wrote:

>
> * Avi Kivity <a...@redhat.com> wrote:
>
>> On 03/17/2010 10:10 AM, Ingo Molnar wrote:
>>>
>>>> It's about who owns the user interface.
>>>>
>>>> If qemu owns the user interface, than we can satisfy this in a very
>>>> simple way by adding a perf monitor command. If we have to support third
>>>> party tools, then it significantly complicates things.
>>>
>>> Of course illogical modularization complicates things 'significantly'.
>>
>> Who should own the user interface then?
>
> If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or
> series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and
> tools/perf/.

It's not a 1:1 connection. There are more users of the KVM interface. To name a few I'm aware of:

- Mac-on-Linux (PPC)
- Dolphin (PPC)
- Xenner (x86)
- Kuli (s390)

Having a clear userspace interface is the only viable solution there. And if you're interested, look at my MOL enabling patch. It's less than 500 lines of code.

The kernel/userspace interface really isn't the difficult part. Getting device emulation working properly, easily and fast is.

Alex--

Ingo Molnar

unread,

Mar 18, 2010, 6:00:02 AM3/18/10

to

* Jes Sorensen <Jes.So...@redhat.com> wrote:

> On 03/17/10 09:10, Ingo Molnar wrote:
>
> > I wish both you and Avi looked back 3-4 years and realized what made KVM
> > so successful back then and why the hearts and minds of virtualization
> > developers were captured by KVM almost overnight.
>
> Ingo,
>
> What made KVM so successful was that the core kernel of the hypervisor was
> designed the right way, as a kernel module where it belonged. It was obvious
> to anyone who had been exposed to the main competition at the time, Xen,
> that this was the right approach. What has ended up killing Xen in the end
> is the not-invented-here approach of copying everything over, reformatting
> it, and rewriting half of it, which made it impossible to maintain and

> support as a single codebase. [...]

Yes, exactly.

I was part of that nightmare so i know.

> [...]

>
> At my previous employer we ended up dropping all Xen efforts exactly because
> it was like maintaining two separate operating system kernels. The key to
> KVM is that once you have Linux, you practically have KVM as well.

Yes. Please realize that what is behind it is a strikingly simple argument:

"Once you have a single project to develop and maintain all is much better."

> > Fast forward to 2010. The kernel side of KVM is maximum goodness - by far
> > the worst-quality remaining aspects of KVM are precisely in areas that you
> > mention: 'if we have to support third party tools, then it significantly
> > complicates things'. You kept Qemu as an external 'third party' entity to
> > KVM, and KVM is clearly hurting from that - just see the recent KVM
> > usability thread for examples about suckage.
> >
> > So a similar 'complication' is the crux of the matter behind KVM quality
> > problems: you've not followed through with the original KVM vision and you
> > have not applied that concept to Qemu!
>
> Well there are two ways to go about this. Either you base the KVM userland
> on top of an existing project, like QEMU, _or_ you rewrite it all from

> scratch. [...]

Btw., i made similar arguments to Avi about 3 years ago when it was going
upstream, that qemu should be unified with KVM. This is more true today than
ever.

> [...] However, there is far more to it than just a couple of ioctls, for

> example the stack of reverse device-drivers is a pretty significant code
> base, rewriting that and maintaining it is not a trivial task. It is
> certainly my belief that the benefit we get from sharing that with QEMU by
> far outweighs the cost of forking it and keeping our own fork in the kernel
> tree. In fact it would result in exactly the same problems I mentioned above
> wrt Xen.

I do not suggest forking Qemu at all, i suggest using the most natural
development model for the KVM+Qemu shared project: a single repository.

> > If you want to jump to the next level of technological quality you need to
> > fix this attitude and you need to go back to the design roots of KVM.
> > Concentrate on Qemu (as that is the weakest link now), make it a first
> > class member of the KVM repo and simplify your development model by having
> > a single repo:
> >
> > - move a clean (and minimal) version of the Qemu code base to tools/kvm/,
> > in the upstream kernel repo, and work on that from that point on.
>
> With this you have just thrown away all the benefits of having the QEMU
> repository shared with other developers who will actively fix bugs in
> components we do care about for KVM.

Not if it's a unified project.

> > - encourage kernel-space and user-space KVM developers to work on both
> > user-space and kernel-space bits as a single unit. It's one project and
> > a single experience to the user.
>
> This is already happening and a total non issue.

My experience as an external observer of the end result contradicts this.

Seemingly trivial usability changes to the KVM+Qemu combo are not being done
often because they involve cross-discipline changes.

( _In this very thread_ there has been a somewhat self-defeating argument by
Anthony that multi-package scenario would 'significantly complicate'
matters. What more proof do we need to state the obvious? Keeping what
has become one piece of technology over the years in two separate halves is
obviously bad. )

> > - [ and probably libvirt should go there too ]
>
> Now that would be interesting, next we'll have to include things like libxml
> in the kernel git tree as well, to make sure libvirt doesn't get out of sync
> with the version supplied by your distribution vendor.

The way we have gone about this in tools/perf/ is similar to the route picked
by Git: we only use very lowlevel libraries available everywhere, and we
provide optional wrappers to the rest.

We are also using the kernel's libraries so we rarely need to go outside to
get some functionality.

I.e. it's a non-issue in practice and despite perf having an (optional)
dependency on xmlto and docbook we dont include those packages nor do we force
users to install particular versions of them.

> > Yes, i've read a thousand excuses for why this is an absolutely impossible
> > and a bad thing to do, and none of them was really convincing to me - and
> > you also have become rather emotional about all the arguments so it's hard
> > to argue about it on a technical basis.
>
> So far your argument would justify pulling all of gdb into the kernel git
> tree as well, to support the kgdb efforts, or gcc so we can get rid of the
> gcc version quirks in the kernel header files, e2fsprogs and equivalent for
> _all_ file systems included in the kernel so we can make sure our fs tools
> never get out of sync with whats supported in the kernel......

gdb and gcc is clearly extrinsic to the kernel so why would we move them
there?

I was talking about tools that are closely related to the kernel - where much
of the development and actual use is in combination with the Linux kernel.

90%+ of the Qemu usecases are combined with Linux. (Yes, i know that you can
run Qemu without KVM, and no, i dont think it matters in the grand scheme of
things and most investment into Qemu comes from the KVM angle these days. In
particular it for sure does not justify handicapping future KVM evolution so
drastically.)

> > We made a similar (admittedly very difficult ...) design jump from
> > oprofile to perf, and i can tell you from that experience that it's day
> > and night, both in terms of development and in terms of the end result!
>
> The user components for perf vs oprofile are _tiny_ projects compared to the
> portions of QEMU that are actually used by KVM.

I know the size and scope of Qemu, i even hacked it - still my points remain.
(my arguments are influenced and strengthened by that past hacking experience)

> Oh and you completely forgot SeaBIOS. KVM+QEMU rely on SeaBIOS too, so from
> what you're saying we should pull that into the kernel git repository as
> well. Never mind the fact that we share SeaBIOS with the coreboot project
> which is very actively adding features to it that benefit us as well.....

SeaBIOS is in essence a firmware, so it could either be loaded as such.

Just look at the qemu source code - the BIOSes are .bin images in
qemu/pc-bios/ imported externally in essence.

Moving qemu to tools/kvm/ would not change that much. The firmware could
become part of /lib/firmware/*.bin.

( That would probably be a more intelligent approach to the BIOS image import
problem as well. )

> Sorry, but there are times when unification make sense, and there are times
> where having a reasonably well designed split makes sense. KVM had problems
> with QEMU in the past which resulted in the qemu-kvm branch of it, which
> proved to be a major pain to deal with, but that is fortunately improving
> and qemu-kvm should go away completely at some point.

qemu-kvm branch is not similar to my proposal at all: it made KVM _more_
fragmented, not more unified. I.e. it was a move in the exact opposite
direction and i'd expect such a move to fail.

In fact the failure of qemu-kvm supports my point rather explicitly: it
demonstrates that extra packages and split development are actively harmful.

I speak about this as a person who has done successful unifications of split
codebases and in my judgement this move would be significantly beneficial to
KVM.

You cannot really validly reject this proposal with "It wont work" as it
clearly has worked in other, comparable cases. You could only reject this with
"I have tried it and it didnt work".

Think about it: a clean and hackable user-space component in tools/kvm/. It's
very tempting :-)

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 6:20:02 AM3/18/10

to

* Alexander Graf <ag...@suse.de> wrote:

>
> On 18.03.2010, at 09:56, Ingo Molnar wrote:
>
> >
> > * Avi Kivity <a...@redhat.com> wrote:
> >
> >> On 03/17/2010 10:10 AM, Ingo Molnar wrote:
> >>>
> >>>> It's about who owns the user interface.
> >>>>
> >>>> If qemu owns the user interface, than we can satisfy this in a very
> >>>> simple way by adding a perf monitor command. If we have to support third
> >>>> party tools, then it significantly complicates things.
> >>>
> >>> Of course illogical modularization complicates things 'significantly'.
> >>
> >> Who should own the user interface then?
> >
> > If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or
> > series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and
> > tools/perf/.
>
> It's not a 1:1 connection. There are more users of the KVM interface. To
> name a few I'm aware of:
>
> - Mac-on-Linux (PPC)
> - Dolphin (PPC)
> - Xenner (x86)
> - Kuli (s390)

There must be a misunderstanding here: tools/perf/ still has a clear userspace
interface and ABI. There's external projects making use of it: sysprof and
libpfm (and probably more i dont know about it). Those projects are also
contributing back.

Still it's _very_ useful to have a single reference implementation under
tools/perf/ where we concentrate the best of the code. That is where we make
sure that each new kernel feature is appropriately implemented in user-space
as well, that the combination works well together and is releasable to users.
That is what keeps us all honest: the latency of features is much lower, and
there's no ping-pong of blame going on between the two components in case of
bugs or in case of misfeatures.

Same goes for KVM+Qemu: it would be so much nicer to have a single,
well-focused reference implementation under tools/kvm/ and have improvements
flow into that code base.

That way KVM developers cannot just shrug "well, GUI suckage is a user-space
problem" - like the answers i got in the KVM usability thread ...

The buck will stop here.

And if someone thinks he can do better an external project can be started
anytime. (it may even replace the upstream thing if it's better)

> Having a clear userspace interface is the only viable solution there. And if
> you're interested, look at my MOL enabling patch. It's less than 500 lines
> of code.

Why do you suppose that what i propose is an "either or" scenario?

It isnt. I just suggested that instead of letting core KVM fragment its limbs
into an external entity, put your name behind one good all-around solution and
focus the development model into a single project.

I.e. do what KVM has done originally in the kernel space to begin with - and
where it was so much better than Xen: single focus.

Learn from what KVM has done so well in the initial years and use the concept
on the user-space components as well. The very same arguments that caused KVM
to integrate into the upstream kernel (instead of being a separate project)
are a valid basis to integrate the user-space components into tools/kvm/. Dont
forget your roots and dont assume all your design decisions were correct.

> The kernel/userspace interface really isn't the difficult part. Getting
> device emulation working properly, easily and fast is.

The kernel/userspace ABI is not difficult at all. Getting device emulation
working properly, easily and fast indeed is. And my experience is that it is
not working properly nor quickly at the moment, at all. (see the 'KVM
usability' thread)

Getting device emulation working properly often involves putting certain
pieces that are currently done in Qemu into kernel-space. That kind of
'movement of emulation technology' from user-space component into the
kernel-space component [or back] would very clearly be helped if those two
components were in the same repository.

And i have first-hand experience there: we had (and have) similar scenarios
with tools/perf routinely. We did some aspects in user-space, then decided to
do it in kernel-space. Sometimes we moved kernel bits to user-space. It was
very easy and there were no package and version complications as it's a single
project. Sometimes we even moved bits back and forth until we found the right
balance.

Thanks,

Ingo

Avi Kivity

unread,

Mar 18, 2010, 6:20:03 AM3/18/10

to

On 03/18/2010 10:56 AM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>> On 03/17/2010 10:10 AM, Ingo Molnar wrote:
>>
>>>
>>>> It's about who owns the user interface.
>>>>
>>>> If qemu owns the user interface, than we can satisfy this in a very
>>>> simple way by adding a perf monitor command. If we have to support third
>>>> party tools, then it significantly complicates things.
>>>>
>>> Of course illogical modularization complicates things 'significantly'.
>>>
>> Who should own the user interface then?
>>
> If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or
> series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and
> tools/perf/.
>

We would have exactly the same issues, only they would be in a single
repository. The only difference is that we could ignore potential
alternatives to qemu, libvirt, and RHEV-M. But that's not how kernel
ABIs are developed, we try to make them general, not suited to just one
consumer that happens to be close to our heart.

> Numerous times did we have patches to kernel/perf_event.c that fixed some
> detail, also accompanied by a tools/perf/ patch fixing another detail. Having
> a single 'culture of contribution' is a powerful way to develop.
>

In fact kvm started out in a single repo, and it certainly made it easy
to bring it up in baby steps. But we've long outgrown that. Maybe the
difference is that perf is still new and thus needs tight cooperation.
If/when perf gains a real GUI, I doubt more than 1% of the patches will
touch both kernel and userspace.

> It turns out kernel developers can be pretty good user-space developers as
> well and user-space developers can be pretty good kernel developers as well.
> Some like to do both - as long as it's all within a single project.
>

Very childish of them. If someone wants to contribute to a userspace
project, they can swallow their pride and send patches to a non-kernel
mailing list and repository.

> The moment any change (be it as trivial as fixing a GUI detail or as complex
> as a new feature) involves two or more packages, development speed slows down
> to a crawl - while the complexity of the change might be very low!
>

Why is that?

I the maintainers of all packages are cooperative and responsive, then
the patches will get accepted quickly. If they aren't, development will
be slow. It isn't any different from contributing to two unrelated
kernel subsystems (which are in fact in different repositories until the
next merge window).

> Also, there's the harmful process that people start categorizing themselves
> into 'I am a kernel developer' and 'I am a user space programmer' stereotypes,
> which limits the scope of contributions artificially.
>

You're encouraging this with your proposal. You're basically using the
glory of kernel development to attract people to userspace.

>>> Fast forward to 2010. The kernel side of KVM is maximum goodness - by far
>>> the worst-quality remaining aspects of KVM are precisely in areas that you
>>> mention: 'if we have to support third party tools, then it significantly
>>> complicates things'. You kept Qemu as an external 'third party' entity to
>>> KVM, and KVM is clearly hurting from that - just see the recent KVM
>>> usability thread for examples about suckage.
>>>
>> Any qemu usability problems are because developers (or their employers) are
>> not interested in fixing them, not because of the repository location. Most
>> kvm developer interest is in server-side deployment (even for desktop
>> guests), so there is limited effort in implementing a virtualbox-style GUI.
>>
> The same has been said of oprofile as well: 'it somewhat sucks because we are
> too server centric', 'nobody is interested in good usability and oprofile is
> fine for the enterprises'. Ironically, the same has been said of Xen usability
> as well, up to the point KVM came around.
>
> What was the core of the problem was a bad design and a split kernel-side
> user-side tool landscape.
>

I can accept the bad design (not knowing any of the details), but how
can the kernel/user split affect usability?

> In fact i think saying that 'our developers only care about the server' is
> borderline dishonest, when at the same time you are making it doubly sure (by
> inaction) that it stays so: by leaving an artificial package wall between
> kernel-side KVM and user-side KVM and not integrating the two technologies.
>

The wall is maybe four nanometers high. Please be serious. If someone
wants to work on qemu usability all they have to do is to clone the
repository and start sending patches to qemu-devel@. What's gained by
putting it in the kernel repository? You're saving a minute's worth of
clone, and that only for people who already happen to be kernel developers.

> You'll never know what heights you could achieve if you leave that wall there
> ...
>

I truly don't know. What highly usable GUIs were developed in the kernel?

> Furthermore, what should be realized is that bad usability hurts "server
> features" just as much. Most of the day-to-day testing is done on the desktop
> by desktop oriented testers/developers. _Not_ by enterprise shops - they tend
> to see the code years down the line to begin with ...
>
> Yes, a particular feature might be server oriented, but a good portion of our
> testing is on the desktop and everyone is hurting from bad usability and this
> puts limits on contribution efficiency.
>

I'm not saying that improved usability isn't a good thing, but time
spent on improving the GUI is time not spent on the features that we
really want.

Desktop oriented users also rarely test 16 vcpu guests with tons of RAM
exercising 10Gb NICs and a SAN. Instead they care about graphics
performance for 2vcpu/1GB guests.

> As the patch posted in _this very thread demonstrates it_, it is doubly more
> difficult to contribute a joint KVM+Qemu feature, because it's two separate
> code bases, two contribution guidelines, two release schedules. While to the
> user it really is just one and the same thing. It should be so for the
> developer as well.
>

It's hard to contribute a patch that goes against the architecture of
the system, where kvm deals with cpu virtualization, qemu (or
theoretically another tool) manages a guest, and libvirt (or another
tool) manages the host. You want a list of guests to be provided by
qemu or the kernel, and that simply isn't how the system works.

> Put in another way: KVM's current split design is making it easy to contribute
> server features (because the kernel side is clean and cool), but also makes it
> artificially hard to contribute desktop features: because the tooling side
> (Qemu) is 'just another package', is separated by a package and maintenance
> wall

Most server oriented patches in qemu/kvm have gone into qemu, not kvm
(simply because it sees many more patches overall). It isn't hard to
contribute to 'just another package', I have 1700 packages installed on
my desktop and only one of them is a kernel.

Anyway your arguments apply equally well to gedit.

> and is made somewhat uncool by a (as some KVM developers have pointed out
> in this thread) quirky codebase.
>

The qemu codebase is in fact quirky, but cp won't solve it. Only long
patchsets to qemu-devel@.

> (the rest of your points are really a function of this fundamental
> disagreement)
>

I disagree.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 18, 2010, 6:30:01 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/18/2010 10:56 AM, Ingo Molnar wrote:
> >* Avi Kivity<a...@redhat.com> wrote:
> >
> >>On 03/17/2010 10:10 AM, Ingo Molnar wrote:
> >>>>It's about who owns the user interface.
> >>>>
> >>>>If qemu owns the user interface, than we can satisfy this in a very
> >>>>simple way by adding a perf monitor command. If we have to support third
> >>>>party tools, then it significantly complicates things.
> >>>Of course illogical modularization complicates things 'significantly'.
> >>Who should own the user interface then?
> >If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or
> >series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and
> >tools/perf/.
>
> We would have exactly the same issues, only they would be in a single
> repository. The only difference is that we could ignore potential
> alternatives to qemu, libvirt, and RHEV-M. But that's not how kernel ABIs
> are developed, we try to make them general, not suited to just one consumer
> that happens to be close to our heart.

Not at all - as i replied to in a previous mail, tools/perf/ still has a clear
userspace interface and ABI, and external projects are making use of it.

So there's no problem with the ABI at all.

In fact our experience has been the opposite: the perf ABI is markedly better
_because_ there's an immediate consumer of it in the form of tools/perf/. It
gets tested better and external projects can get their ABI tweaks in as well
and can provide a reference implementation for tools/perf. This has happened a
couple of times. It's a win-win scenario.

So the exact opposite of what you suggest is happening in practice.

Thanks,

Ingo

Avi Kivity

unread,

Mar 18, 2010, 6:30:02 AM3/18/10

to

On 03/18/2010 12:10 PM, Ingo Molnar wrote:
>
>> It's not a 1:1 connection. There are more users of the KVM interface. To
>> name a few I'm aware of:
>>
>> - Mac-on-Linux (PPC)
>> - Dolphin (PPC)
>> - Xenner (x86)
>> - Kuli (s390)
>>
> There must be a misunderstanding here: tools/perf/ still has a clear userspace
> interface and ABI. There's external projects making use of it: sysprof and
> libpfm (and probably more i dont know about it). Those projects are also
> contributing back.
>

So it seems it is possible to scale the package wall.

> Still it's _very_ useful to have a single reference implementation under
> tools/perf/ where we concentrate the best of the code. That is where we make
> sure that each new kernel feature is appropriately implemented in user-space
> as well, that the combination works well together and is releasable to users.
> That is what keeps us all honest: the latency of features is much lower, and
> there's no ping-pong of blame going on between the two components in case of
> bugs or in case of misfeatures.
>

That would make sense for a truly minimal userspace for kvm: we once had
a tool called kvmctl which was used to run tests (since folded into
qemu). It didn't contain a GUI and was unable to run a general purpose
guest. It was a few hundred lines of code, and indeed patches to kvmctl
had a much closer correspondence to patches with kvm (though still low,
as most kvm patches don't modify the ABI).

> Same goes for KVM+Qemu: it would be so much nicer to have a single,
> well-focused reference implementation under tools/kvm/ and have improvements
> flow into that code base.
>
> That way KVM developers cannot just shrug "well, GUI suckage is a user-space
> problem" - like the answers i got in the KVM usability thread ...
>
> The buck will stop here.
>

Suppose we copy qemu tomorrow into tools/. All the problems will be
copied with it. Someone still has to write patches to fix them. Who
will it be?

>> The kernel/userspace interface really isn't the difficult part. Getting
>> device emulation working properly, easily and fast is.
>>
> The kernel/userspace ABI is not difficult at all. Getting device emulation
> working properly, easily and fast indeed is. And my experience is that it is
> not working properly nor quickly at the moment, at all. (see the 'KVM
> usability' thread)
>
> Getting device emulation working properly often involves putting certain
> pieces that are currently done in Qemu into kernel-space. That kind of
> 'movement of emulation technology' from user-space component into the
> kernel-space component [or back] would very clearly be helped if those two
> components were in the same repository.
>

Moving emulation into the kernel is indeed a problem. Not because it's
difficult, but because it indicates that the interfaces exposed to
userspace are insufficient to obtain good performance. We had that with
vhost-net and I'm afraid we'll have that with vhost-blk.

> And i have first-hand experience there: we had (and have) similar scenarios
> with tools/perf routinely. We did some aspects in user-space, then decided to
> do it in kernel-space. Sometimes we moved kernel bits to user-space. It was
> very easy and there were no package and version complications as it's a single
> project. Sometimes we even moved bits back and forth until we found the right
> balance.
>

That's reasonable in the first iterations of a project.

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Mar 18, 2010, 6:40:02 AM3/18/10

to

On 03/18/2010 11:22 AM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>>> - move a clean (and minimal) version of the Qemu code base to tools/kvm/,
>>> in the upstream kernel repo, and work on that from that point on.
>>>
>> I'll ignore the repository location which should be immaterial to a serious
>> developer and concentrate on the 'clean and minimal' aspect, since it has
>> some merit. [...]
>>
> To the contrary, experience shows that repository location, and in particular
> a shared repository for closely related bits is very much material!
>
> It matters because when there are two separate projects, even a "serious
> developer" is finding it double and triple difficult to contribute even
> trivial changes.
>
> It becomes literally a nightmare if you have to touch 3 packages: kernel, a
> library and an app codebase. It takes _forever_ to get anything useful done.
>

You can't be serious. I find that the difficulty in contributing a
patch has mostly to do with writing the patch, and less with figuring
out which email address to send it to.

> Also, 'focus on a single thing' is a very basic aspect of humans, especially
> those who do computer programming. Working on two code bases in two
> repositories at once can be very challenging physically and psychically.
>

Indeed, working simultaneously on two different projects is difficult.
I usually work for a while on one, and then 'cd', physically and
psychically, to the other. Then switch back. Sort of like the
scheduler on a uniprocessor machine.

> So what i've seen is that OSS programmers tend to pick a side, pretty much
> randomly, and then rationalize it in hindsight why they prefer that side ;-)
>
> Most of them become either a kernel developer or a user-space package
> developer - and then they specialize on that field and shy away from changes
> that involve both. It's a basic human thing to avoid the hassle that comes
> with multi-package changes. (One really has to be outright stupid, fanatic or
> desperate to even attempt such changes these days - such are the difficulties
> for a comparatively low return.)
>

We have a large number of such stupid, fanatic, desperate developers in
the qemu and kvm communities.

> The solution is to tear down such artificial walls of contribution where
> possible. And tearing down the wall between KVM and qemu-kvm seems very much
> possible and the advantages would be numerous.
>
> Unless by "serious developer" you meant: "developer willing to [or forced to]
> waste time and effort on illogically structured technology".
>

By "serious developer" I mean

- someone who is interested in contributing, not in getting their name
into the kernel commits list
- someone who is willing to read the wiki page and find out where the
repository and mailing list for a project is
- someone who will spend enough time on the project so that the time
to clone two repositories will not be a factor in their contributions
- someone who will work on the uncool stuff like fixing bugs and
providing interfaces to other tools

>> [...]
>>
>> Do you really think the echo'n'cat school of usability wants to write a GUI?
>> In linux-2.6.git?
>>
> Then you'll be surprised to hear that it's happening as we speak and the
> commits are there in linux-2.6.git. Both a TUI and GUI is in the works.
>
> Furthermore, the numbers show that half of the usability fixes to tools/perf/
> came not from regular perf contributors but from random kernel developers and
> testers who when they build the latest kernel and try out perf at the same
> time (it's very easy because you already have it in the kernel repository - no
> separate download, no installation, etc. necessary).
>
> I had literally zero such contributions when (the precursor to) 'perf' was
> still a separate user-space project.
>
> You could have the same effect for Qemu: the latest bits in tools/kvm/ would
> be built by regular kernel testers and developers. The integration benefits
> dont just extend to developers, a unified project is vastly easier to test as
> well.
>
>

Let's wait and see then. If the tools/perf/ experience has really good
results, we can reconsider this at a later date.

--
error compiling committee.c: too many arguments to function

--

Jes Sorensen

unread,

Mar 18, 2010, 6:50:03 AM3/18/10

to

On 03/18/10 10:54, Ingo Molnar wrote:
> * Jes Sorensen<Jes.So...@redhat.com> wrote:
[...]
>>
>> At my previous employer we ended up dropping all Xen efforts exactly because
>> it was like maintaining two separate operating system kernels. The key to
>> KVM is that once you have Linux, you practically have KVM as well.
>
> Yes. Please realize that what is behind it is a strikingly simple argument:
>
> "Once you have a single project to develop and maintain all is much better."

Thats a very glorified statement but it's not reality, sorry. You can do
that with something like perf because it's so small and development of
perf is limited to a very small group of developers.

>> [...] However, there is far more to it than just a couple of ioctls, for
>> example the stack of reverse device-drivers is a pretty significant code
>> base, rewriting that and maintaining it is not a trivial task. It is
>> certainly my belief that the benefit we get from sharing that with QEMU by
>> far outweighs the cost of forking it and keeping our own fork in the kernel
>> tree. In fact it would result in exactly the same problems I mentioned above
>> wrt Xen.
>
> I do not suggest forking Qemu at all, i suggest using the most natural
> development model for the KVM+Qemu shared project: a single repository.

If you are not suggesting to fork QEMU, what are you suggesting then?
You don't seriously expect that the KVM community will be able to
mandate that the QEMU community switch to the Linux kernel repository?
That would be like telling the openssl developers that they should merge
with glibc and start working out of the glibc tree.

What you are suggesting is *only* going to happen if we fork QEMU, there
is zero chance to move the main QEMU repository into the Linux kernel
tree. And trust me, you don't want to have Linus having to deal with
handling patches for tcg or embedded board emulation.

>> With this you have just thrown away all the benefits of having the QEMU
>> repository shared with other developers who will actively fix bugs in
>> components we do care about for KVM.
>
> Not if it's a unified project.

You still haven't explained how you expect create a unified KVM+QEMU
project, without forking from the existing QEMU.

>>> - encourage kernel-space and user-space KVM developers to work on both
>>> user-space and kernel-space bits as a single unit. It's one project and
>>> a single experience to the user.
>>
>> This is already happening and a total non issue.
>
> My experience as an external observer of the end result contradicts this.

What I have seen you complain about here is the lack of a good end user
GUI for KVM. However that is a different thing. So far no vendor has put
significant effort into it, but that is nothing new in Linux. We have a
great kernel, but our user applications are still lacking. We have 217
CD players for GNOME, but we have no usable calendering application.

A good GUI for virtualization is a big task, and whoever designs it will
base their design upon their preferences for whats important. A lot of
spare time developers would clearly care most about a gui installation
and fancy icons to click on, whereas server users would be much more
interested in automation and remote access to the systems. For a good
example of an incomplete solution, try installing Fedora over a serial
line, you cannot do half the things without launching VNC :( Getting a
comprehensive solution for this that would satisfy the bulk of the users
would be a huge chunk of code in the kernel tree. Imagine the screaming
that would result in? How often have we not had the moaning from x86
users who wanted to rip out all the non x86 code to reduce the size of
the tarball?

> Seemingly trivial usability changes to the KVM+Qemu combo are not being done
> often because they involve cross-discipline changes.

Which trivial usability changes?

>>> - [ and probably libvirt should go there too ]
>>
>> Now that would be interesting, next we'll have to include things like libxml
>> in the kernel git tree as well, to make sure libvirt doesn't get out of sync
>> with the version supplied by your distribution vendor.
>
> The way we have gone about this in tools/perf/ is similar to the route picked
> by Git: we only use very lowlevel libraries available everywhere, and we
> provide optional wrappers to the rest.

Did you ever look at what libvirt actually does and what it offers? Or
how about the various libraries used by QEMU to offer things like VNC
support or X support?

Again this works fine for something like perf where the primary
display is text mode.

>> So far your argument would justify pulling all of gdb into the kernel git
>> tree as well, to support the kgdb efforts, or gcc so we can get rid of the
>> gcc version quirks in the kernel header files, e2fsprogs and equivalent for
>> _all_ file systems included in the kernel so we can make sure our fs tools
>> never get out of sync with whats supported in the kernel......
>
> gdb and gcc is clearly extrinsic to the kernel so why would we move them
> there?

gdb should go with kgdb which goes with the kernel to keep it in sync.
If you want to be consistent in your argument, you have to go all the
way.

> I was talking about tools that are closely related to the kernel - where much
> of the development and actual use is in combination with the Linux kernel.

Well the file system tools would obviously have to go into the kernel
then so appropriate binaries can be distributed to match the kernel.

> 90%+ of the Qemu usecases are combined with Linux. (Yes, i know that you can
> run Qemu without KVM, and no, i dont think it matters in the grand scheme of
> things and most investment into Qemu comes from the KVM angle these days. In
> particular it for sure does not justify handicapping future KVM evolution so
> drastically.)

90+%? You got to be kidding? You clearly have no idea just how much it's
used for running embedded emulators on non Linux. You should have seen
the noise it made when I added C99 initializers to certain structs,
because it broke builds using very old GCC versions on BeOS. Linux only,
not a chance. Try subscribing to qemu-devel and you'll see a list that
is only overtaken by few lists like lkml in terms of daily traffic.

>> Oh and you completely forgot SeaBIOS. KVM+QEMU rely on SeaBIOS too, so from
>> what you're saying we should pull that into the kernel git repository as
>> well. Never mind the fact that we share SeaBIOS with the coreboot project
>> which is very actively adding features to it that benefit us as well.....
>
> SeaBIOS is in essence a firmware, so it could either be loaded as such.
>
> Just look at the qemu source code - the BIOSes are .bin images in
> qemu/pc-bios/ imported externally in essence.

Ehm no, QEMU now pulls in SeaBIOS to build it. And there are a lot of
changes that require modification in SeaBIOS to match changes to QEMU.

> qemu-kvm branch is not similar to my proposal at all: it made KVM _more_
> fragmented, not more unified. I.e. it was a move in the exact opposite
> direction and i'd expect such a move to fail.
>
> In fact the failure of qemu-kvm supports my point rather explicitly: it
> demonstrates that extra packages and split development are actively harmful.

Ehm it showed what happens when you fork QEMU to modify it primarily for
your own project, ie. KVM. You are suggesting we fork QEMU for the
benefit of KVM, and it will be exactly the same thing that happens.

I know you state that you are not suggesting we fork it, but as I showed
above, pulling QEMU into the kernel tree, can only happen as a fork.
There is no point pretending otherwise.

> I speak about this as a person who has done successful unifications of split
> codebases and in my judgement this move would be significantly beneficial to
> KVM.
>
> You cannot really validly reject this proposal with "It wont work" as it
> clearly has worked in other, comparable cases. You could only reject this with
> "I have tried it and it didnt work".
>
> Think about it: a clean and hackable user-space component in tools/kvm/. It's
> very tempting :-)

I say this based on my hacking experience, my experience with the
kernel, the QEMU base, SeaBIOS and merging projects. Yes it can be done,
but the cost is much higher than the gain.

Cheers,
Jes

Ingo Molnar

unread,

Mar 18, 2010, 7:00:02 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> > The moment any change (be it as trivial as fixing a GUI detail or as
> > complex as a new feature) involves two or more packages, development speed
> > slows down to a crawl - while the complexity of the change might be very
> > low!
>
> Why is that?

It's very simple: because the contribution latencies and overhead compound,
almost inevitably.

If you ever tried to implement a combo GCC+glibc+kernel feature you'll know
...

Even with the best-run projects in existence it takes forever and is very
painful - and here i talk about first hand experience over many years.

> I the maintainers of all packages are cooperative and responsive, then the
> patches will get accepted quickly. If they aren't, development will be

> slow. [...]

I'm afraid practice is different from the rosy ideal you paint there. Even
with assumed 'perfect projects' there's always random differences between
projects, causing doubled (tripled) overhead and compounded up overhead:

- random differences in release schedules

- random differences in contribution guidelines

- random differences in coding style

> [...] It isn't any different from contributing to two unrelated kernel

> subsystems (which are in fact in different repositories until the next merge
> window).

You mention a perfect example: contributing to multipe kernel subsystems. Even
_that_ is very noticeably harder than contributing to a single subsystem - due
to the inevitable buerocratic overhead, due to different development trees,
due to different merge criteria.

So you are underlining my point (perhaps without intending to): treating
closely related bits of technology as a single project is much better.

Obviously arch/x86/kvm/, virt/ and tools/kvm/ should live in a single
development repository (perhaps micro-differentiated by a few topical
branches), for exactly those reasons you mention.

Just like tools/perf/ and kernel/perf_event.c and arch/*/kernel/perf*.c are
treated as a single project.

[ Note: we actually started from a 'split' design [almost everyone picks that,
because of this false 'kernel space bits must be separate from user space
bits' myth] where the user-space component was a separate code base and
unified it later on as the project progressed.

Trust me, the practical benefits of the unified approach are enormous to
developers and to users alike, and there was no looking back once we made
the switch. ]

Also, i dont really try to 'convince' you here - you made your position very
clear early on and despite many unopposed technical arguments i made, the
positions seem to have hardened and i expect it wont change, no matter what
arguments i bring. It's a pity but hey, i'm just an observer here really -
it's the rest of _your_ life this all impacts.

I just wanted to point out the root cause of KVM's usability problems as i see
it - just like i was pointing out the mortal Xen design deficiencies back when
i was backing KVM strongly, four years ago. Back then everyone was saying that
i'm crazy and we are stuck with Xen forever and while KVM is nice it has no
chance.

Just because you got the kernel bits of KVM right a few years ago does not
mean you cannot mess up other design aspects, and sometimes badly so ;-)
Historically i messed up more than half of all first-gut-feeling technical
design decisions i did, so i had to correct the course many, many times.

I hope you are still keeping an open mind about it all and dont think that
because the project was split for 4 years (to no fault of your own, simply out
of necessity) it should be split forever ...

arch/x86 was split for a much longer period than that.

Circumstances have changed. Most Qemu users/contributions are now coming from
the KVM angle, so please simply start thinking about the next level of
evolution.

Thanks,

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 7:10:01 AM3/18/10

to

* Jes Sorensen <Jes.So...@redhat.com> wrote:

> On 03/18/10 10:54, Ingo Molnar wrote:
> >* Jes Sorensen<Jes.So...@redhat.com> wrote:
> [...]
> >>
> >>At my previous employer we ended up dropping all Xen efforts exactly because
> >>it was like maintaining two separate operating system kernels. The key to
> >>KVM is that once you have Linux, you practically have KVM as well.
> >
> >Yes. Please realize that what is behind it is a strikingly simple argument:
> >
> > "Once you have a single project to develop and maintain all is much better."
>
> Thats a very glorified statement but it's not reality, sorry. You can do
> that with something like perf because it's so small and development of perf
> is limited to a very small group of developers.

I was not talking about just perf: i am also talking about the arch/x86/
unification which is 200+ KLOC of highly non-trivial kernel code with hundreds
of contributors and with 8000+ commits in the past two years.

Also, it applies to perf as well: people said exactly that a year ago: 'perf
has it easy to be clean as it is small, once it gets as large as Oprofile
tooling it will be in the same messy situation'.

Today perf has more features than Oprofile, has a larger and more complex code
base, has more contributors, and no, it's not in the same messy situation at
all.

So whatever you think of large, unified projects, you are quite clearly
mistaken. I have done and maintained through two different types of
unifications and the experience was very similar: both developers and users
(and maintainers) are much better off.

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 7:30:02 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/18/2010 11:22 AM, Ingo Molnar wrote:
> >* Avi Kivity<a...@redhat.com> wrote:
> >
> >>> - move a clean (and minimal) version of the Qemu code base to tools/kvm/,
> >>> in the upstream kernel repo, and work on that from that point on.
> >>I'll ignore the repository location which should be immaterial to a serious
> >>developer and concentrate on the 'clean and minimal' aspect, since it has
> >>some merit. [...]
> >
> > To the contrary, experience shows that repository location, and in
> > particular a shared repository for closely related bits is very much
> > material!
> >
> > It matters because when there are two separate projects, even a "serious
> > developer" is finding it double and triple difficult to contribute even
> > trivial changes.
> >
> > It becomes literally a nightmare if you have to touch 3 packages: kernel,
> > a library and an app codebase. It takes _forever_ to get anything useful
> > done.
>
> You can't be serious. I find that the difficulty in contributing a patch
> has mostly to do with writing the patch, and less with figuring out which
> email address to send it to.

My own experience and everyone i've talked about such topics (developers and
distro people) about feature contribution tells the exact opposite: it's much
harder to contribute features to multiple packages than to a single project.

kernel+library+app features take forever to propagate, and there's constant
fear of version friction, productization deadlines are uncertain and ABI
messups are frequent as well due to disjoint testing. Also, each component has
essential veto power: so if the proposed API or approach is opposed or changed
in a later stage then that affects (sometimes already committed) changes. If
you've ever done it you'll know how tedious it is.

This very thread and recent threads about KVM usability demonstrate the same
complications.

Thanks,

Ingo

Avi Kivity

unread,

Mar 18, 2010, 7:40:02 AM3/18/10

to

On 03/18/2010 12:50 PM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>>> The moment any change (be it as trivial as fixing a GUI detail or as
>>> complex as a new feature) involves two or more packages, development speed
>>> slows down to a crawl - while the complexity of the change might be very
>>> low!
>>>
>> Why is that?
>>
> It's very simple: because the contribution latencies and overhead compound,
> almost inevitably.
>

It's not inevitable, if the projects are badly run, you'll have high
latencies, but projects don't have to be badly run.

> If you ever tried to implement a combo GCC+glibc+kernel feature you'll know
> ...
>
> Even with the best-run projects in existence it takes forever and is very
> painful - and here i talk about first hand experience over many years.
>

Try sending a patch to qemu-devel@, you may be pleasantly surprised.

>> I the maintainers of all packages are cooperative and responsive, then the
>> patches will get accepted quickly. If they aren't, development will be
>> slow. [...]
>>
> I'm afraid practice is different from the rosy ideal you paint there. Even
> with assumed 'perfect projects' there's always random differences between
> projects, causing doubled (tripled) overhead and compounded up overhead:
>
> - random differences in release schedules
>
> - random differences in contribution guidelines
>
> - random differences in coding style
>

None of these matter for steady contributors.

>> [...] It isn't any different from contributing to two unrelated kernel
>> subsystems (which are in fact in different repositories until the next merge
>> window).
>>
> You mention a perfect example: contributing to multipe kernel subsystems. Even
> _that_ is very noticeably harder than contributing to a single subsystem - due
> to the inevitable buerocratic overhead, due to different development trees,
> due to different merge criteria.
>
> So you are underlining my point (perhaps without intending to): treating
> closely related bits of technology as a single project is much better.
>
> Obviously arch/x86/kvm/, virt/ and tools/kvm/ should live in a single
> development repository (perhaps micro-differentiated by a few topical
> branches), for exactly those reasons you mention.
>

How is a patch for the qemu GUI eject button and the kvm shadow mmu
related? Should a single maintainer deal with both?

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 18, 2010, 7:40:02 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> > Still it's _very_ useful to have a single reference implementation under
> > tools/perf/ where we concentrate the best of the code. That is where we
> > make sure that each new kernel feature is appropriately implemented in
> > user-space as well, that the combination works well together and is
> > releasable to users. That is what keeps us all honest: the latency of
> > features is much lower, and there's no ping-pong of blame going on between
> > the two components in case of bugs or in case of misfeatures.
>
> That would make sense for a truly minimal userspace for kvm: we once had a
> tool called kvmctl which was used to run tests (since folded into qemu). It
> didn't contain a GUI and was unable to run a general purpose guest. It was
> a few hundred lines of code, and indeed patches to kvmctl had a much closer
> correspondence to patches with kvm (though still low, as most kvm patches
> don't modify the ABI).

If it's functional to the extent of at least allowing say a serial console via
the console (like the UML binary allows) i'd expect the minimal user-space to
quickly grow out of this minimal state. The rest will be history.

Maybe this is a better, simpler (and much cleaner and less controversial)
approach than moving a 'full' copy of qemu there.

There's certainly no risk: if qemu stays dominant then nothing is lost
[tools/kvm/ can be removed after some time], and if this clean base works out
fine then the useful qemu technologies will move over to it gradually and
without much fuss, and the developers will move with it as well.

If it's just a token effort with near zero utility to begin with it certainly
wont take off.

Once it's there in tools/kvm/ and bootable i'd certainly hack up some quick
xlib based VGA output capability myself - it's not that hard ;-) It would also
allow me to test whether latest-KVM still boots fine in a much simpler way.
(most of my testboxes dont have qemu installed)

So you have one user signed up for that already ;-)

> > Same goes for KVM+Qemu: it would be so much nicer to have a single,
> > well-focused reference implementation under tools/kvm/ and have
> > improvements flow into that code base.
> >
> > That way KVM developers cannot just shrug "well, GUI suckage is a
> > user-space problem" - like the answers i got in the KVM usability thread
> > ...
> >
> > The buck will stop here.
>
> Suppose we copy qemu tomorrow into tools/. All the problems will be copied
> with it. Someone still has to write patches to fix them. Who will it be?

What we saw with tools/perf/ was that pure proximity to actual kernel testers
and kernel developers produces a steady influx of new developers. It didnt
happen overnight, but it happened. A simple:

cd tools/perf/
make -j install

Gets them something to play with. That kind of proximity is very powerful.

The other benefit was that distros can package perf with the kernel package,
so it's updated together with the kernel. This means a very efficient
distribution of new technologies, together with new kernel releases.

Distributions are very eager to update kernels even in stable periods of the
distro lifetime - they are much less willing to update user-space packages.

You can literally get full KVM+userspace features done _and deployed to users_
within the 3 months development cycle of upstream KVM.

All these create synergies that are very clear once you see the process in
motion. It's a powerful positive feedback loop. Give it some thought please.

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 7:50:03 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/18/2010 12:50 PM, Ingo Molnar wrote:
> >* Avi Kivity<a...@redhat.com> wrote:
> >
> >>>The moment any change (be it as trivial as fixing a GUI detail or as
> >>>complex as a new feature) involves two or more packages, development speed
> >>>slows down to a crawl - while the complexity of the change might be very
> >>>low!
> >>Why is that?
> > It's very simple: because the contribution latencies and overhead compound,
> > almost inevitably.
>
> It's not inevitable, if the projects are badly run, you'll have high
> latencies, but projects don't have to be badly run.

So the 64K dollar question is, why does Qemu still suck?

We have co-maintainers for perf that have a different focus. It works pretty
well.

Look at git log tools/perf/ and how user-space and kernel-space components
interact in practice. You'll patches that only impact one side, but you'll see
very big overlap both in contributor identity and in patches as well.

Also, let me put similar questions in a bit different way:

- ' how is an in-kernel PIT emulation connected to Qemu's PIT emulation? '

- ' how is the in-kernel dynticks implementation related to Qemu's
implementation of hardware timers? '

- ' how is an in-kernel event for a CD-ROM eject connected to an in-Qemu
eject event? '

- ' how is a new hardware virtualization feature related to being able to
configure and use it via Qemu? '

- ' how is the in-kernel x86 decoder/emulator related to the Qemu x86
emulator? '

- ' how is the performance of the qemu GUI related to the way VGA buffers are
mapped and accelerated by KVM? '

They are obviously deeply related. The quality of a development process is not
defined by the easy cases where no project unification is needed. The quality
of a development process is defined by the _difficult_ cases.

Ingo

Alexander Graf

unread,

Mar 18, 2010, 8:10:03 AM3/18/10

to

Alright, you just volunteered. Just give it a go and try to implement
the "oh so simple" KVM frontend while maintaining compatibility with at
least a few older Linux guests. My guess is that you'll realize it's a
dead end before committing anything to the kernel source tree. But
really, just try it out.

Good Luck

Alex

Avi Kivity

unread,

Mar 18, 2010, 8:30:02 AM3/18/10

to

On 03/18/2010 01:48 PM, Ingo Molnar wrote:
>
>> It's not inevitable, if the projects are badly run, you'll have high
>> latencies, but projects don't have to be badly run.
>>
> So the 64K dollar question is, why does Qemu still suck?
>

Where people sent patches, it doesn't suck (or sucks less). Where they
don't, it still sucks. And it cost way more than $64K.

If moving things to tools/ helps, let's move Fedora to tools/.

>> How is a patch for the qemu GUI eject button and the kvm shadow mmu related?
>> Should a single maintainer deal with both?
>>
> We have co-maintainers for perf that have a different focus. It works pretty
> well.
>

And it works well when I have patches that change x86 core and kvm. But
that's no longer a single repository and we have to coordinate.

> Look at git log tools/perf/ and how user-space and kernel-space components
> interact in practice. You'll patches that only impact one side, but you'll see
> very big overlap both in contributor identity and in patches as well.
>
> Also, let me put similar questions in a bit different way:
>
> - ' how is an in-kernel PIT emulation connected to Qemu's PIT emulation? '
>

Both implement the same spec. One is be a code derivative of the other
(via Xen).

> - ' how is the in-kernel dynticks implementation related to Qemu's
> implementation of hardware timers? '
>

The quality of host kernel timers directly determines the quality of
qemu's timer emulation.

> - ' how is an in-kernel event for a CD-ROM eject connected to an in-Qemu
> eject event? '
>

Both implement the same spec. The kernel of course needs to handle all
implementation variants, while qemu only needs to implement it once.

> - ' how is a new hardware virtualization feature related to being able to
> configure and use it via Qemu? '
>

Most features (example: npt) are transparent to userspace, some are
not. When they are not, we introduce an ioctl() to kvm for controlling
the feature, and a command-line switch to qemu for calling it.

> - ' how is the in-kernel x86 decoder/emulator related to the Qemu x86
> emulator? '
>

Both implement the same spec. Note qemu is not an emulator but a binary
translator.

> - ' how is the performance of the qemu GUI related to the way VGA buffers are
> mapped and accelerated by KVM? '
>

kvm needs to support direct mapping when possible and efficient data
transfer when not. The latter will obviously be much slower. When
direct mapping is possible, kvm needs to track pages touched by the
guest to avoid full screen redraws. The rest (interfacing to X or vnc,
implementing emulated hardware acceleration, full-screen mode, etc.) are
unrelated.

> They are obviously deeply related.

Not at all. kvm in fact knows nothing about vga, to take your last
example. To suggest that qemu needs to be close to the kernel to
benefit from the kernel's timer implementation means we don't care about
providing quality timing except to ourselves, which luckily isn't the case.

Some time ago the various desktops needed directory change notification,
and people implemented inotify (or whatever it's called today). No one
suggested tools/gnome/ and tools/kde/.

> The quality of a development process is not
> defined by the easy cases where no project unification is needed. The quality
> of a development process is defined by the _difficult_ cases.
>

That's true, but we don't have issues at the qemu/kvm boundary. Note we
do have issues at the qemu/aio interfaces and qemu/net interfaces (out
of which vhost-net was born) but these wouldn't be solved by tools/qemu/.

--
error compiling committee.c: too many arguments to function

--

Frank Ch. Eigler

unread,

Mar 18, 2010, 8:40:03 AM3/18/10

to

Ingo Molnar <mi...@elte.hu> writes:

> [...]

> Distributions are very eager to update kernels even in stable periods of the
> distro lifetime - they are much less willing to update user-space packages.

> [...]

Sorry, er, what? What distributions eagerly upgrade kernels in stable
periods, were it not primarily motivated by security fixes? What users
eagerly replace their kernels?

- FChE

Arnaldo Carvalho de Melo

unread,

Mar 18, 2010, 9:10:03 AM3/18/10

to

Em Thu, Mar 18, 2010 at 09:03:25AM +0100, Ingo Molnar escreveu:
>
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
>
> > I worked out 3 new patches against tip/master tree of Mar. 17th.
>
> Cool! Mind sending them as a series of patches instead of attachment? That
> makes it easier to review them. Also, the Signed-off-by lines seem to be
> missing plus we need a per patch changelog as well.

Yeah, please, and I hadn't merged them, so the resend was the best thing to do.

- Arnaldo

John Kacur

unread,

Mar 18, 2010, 9:10:02 AM3/18/10

to

On Thu, Mar 18, 2010 at 1:33 PM, Frank Ch. Eigler <fc...@redhat.com> wrote:
> Ingo Molnar <mi...@elte.hu> writes:
>
>> [...]
>> Distributions are very eager to update kernels even in stable periods of the
>> distro lifetime - they are much less willing to update user-space packages.
>> [...]
>
> Sorry, er, what? What distributions eagerly upgrade kernels in stable
> periods, were it not primarily motivated by security fixes? What users
> eagerly replace their kernels?
>

Us guys reading and participating on the list. ;)

Ingo Molnar

unread,

Mar 18, 2010, 9:10:02 AM3/18/10

to

* Frank Ch. Eigler <fc...@redhat.com> wrote:

> Ingo Molnar <mi...@elte.hu> writes:
>
> > [...]
> > Distributions are very eager to update kernels even in stable periods of the
> > distro lifetime - they are much less willing to update user-space packages.
> > [...]
>
> Sorry, er, what? What distributions eagerly upgrade kernels in stable

> periods, were it not primarily motivated by security fixes? [...]

Please check the popular distro called 'Fedora' for example, and its kernel
upgrade policies.

> [...] What users eagerly replace their kernels?

Those 99% who click on the 'install 193 updates' popup.

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 9:10:02 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/18/2010 01:48 PM, Ingo Molnar wrote:
>
> > > It's not inevitable, if the projects are badly run, you'll have high
> > > latencies, but projects don't have to be badly run.
> >
> > So the 64K dollar question is, why does Qemu still suck?
>
> Where people sent patches, it doesn't suck (or sucks less). Where they

> don't, it still sucks. [...]

So is your point that the development process and basic code structure does
not matter at all, it's just a matter of people sending patches? I beg to
differ ...

> [...] And it cost way more than $64K.

>
> If moving things to tools/ helps, let's move Fedora to tools/.

Those bits of Fedora which deeply relate to the kernel - yes.
Those bits that are arguably separate - nope.

> >> How is a patch for the qemu GUI eject button and the kvm shadow mmu
> >> related? Should a single maintainer deal with both?
> >
> > We have co-maintainers for perf that have a different focus. It works
> > pretty well.
>
> And it works well when I have patches that change x86 core and kvm. But
> that's no longer a single repository and we have to coordinate.

Actually, it works much better if, contrary to your proposal it ends up in a
single repo. Last i checked both of us really worked on such a project, run by
some guy. (Named Linus or so.)

> Not at all. [...]

You are obviously arguing for something like UML. Fortunately KVM is not that.
Or i hope it isnt.

> [...] kvm in fact knows nothing about vga, to take your last
> example. [...]

Look at the VGA dirty bitmap optimization a'ka the KVM_GET_DIRTY_LOG ioctl.

See qemu/kvm-all.c's kvm_physical_sync_dirty_bitmap().

It started out as a VGA optimization (also used by live migration) and even
today it's mostly used by the VGA drivers - albeit a weak one.

I wish there were stronger VGA optimizations implemented, copying the dirty
bitmap is not a particularly performant solution. (although it's certainly
better than full emulation) Graphics performance is one of the more painful
aspects of KVM usability today.

> [...] To suggest that qemu needs to be close to the kernel to benefit from

> the kernel's timer implementation means we don't care about providing
> quality timing except to ourselves, which luckily isn't the case.

That is not what i said. I said they are closely related, and where
technologies are closely related, project proximity turns into project
unification at a certain stage.

> Some time ago the various desktops needed directory change
> notification, and people implemented inotify (or whatever it's
> called today). No one suggested tools/gnome/ and tools/kde/.

You are misconstruing and misrepresenting my argument - i'd expect better.
Gnome and KDE runs on other kernels as well and is generally not considered
close to the kernel.

Do you seriously argue that Qemu has nothing to do with KVM these days?

> > The quality of a development process is not defined by the easy cases
> > where no project unification is needed. The quality of a development
> > process is defined by the _difficult_ cases.
>
> That's true, but we don't have issues at the qemu/kvm boundary. Note we do
> have issues at the qemu/aio interfaces and qemu/net interfaces (out of which
> vhost-net was born) but these wouldn't be solved by tools/qemu/.

That was not what i suggested. They would be solved by what i proposed:
tools/kvm/, right?

Thanks,

Ingo

Avi Kivity

unread,

Mar 18, 2010, 9:20:03 AM3/18/10

to

On 03/18/2010 03:02 PM, Ingo Molnar wrote:
>
>> [...] What users eagerly replace their kernels?
>>
> Those 99% who click on the 'install 193 updates' popup.
>
>

Of which 1 is the kernel, and 192 are userspace updates (of which one
may be qemu).

--
error compiling committee.c: too many arguments to function

--

Jes Sorensen

unread,

Mar 18, 2010, 9:30:02 AM3/18/10

to

On 03/18/10 11:58, Ingo Molnar wrote:
>
> * Jes Sorensen<Jes.So...@redhat.com> wrote:
>> Thats a very glorified statement but it's not reality, sorry. You can do
>> that with something like perf because it's so small and development of perf
>> is limited to a very small group of developers.
>
> I was not talking about just perf: i am also talking about the arch/x86/
> unification which is 200+ KLOC of highly non-trivial kernel code with hundreds
> of contributors and with 8000+ commits in the past two years.

Sorry but you cannot compare merging two chunks of kernel code that
originated from the same base, with the efforts of mixing a large
userland project with a kernel component. Apples and oranges.

> Also, it applies to perf as well: people said exactly that a year ago: 'perf
> has it easy to be clean as it is small, once it gets as large as Oprofile
> tooling it will be in the same messy situation'.
>
> Today perf has more features than Oprofile, has a larger and more complex code
> base, has more contributors, and no, it's not in the same messy situation at
> all.

Both perf and oprofile are still relatively small projects in comparison
to QEMU.

> So whatever you think of large, unified projects, you are quite clearly
> mistaken. I have done and maintained through two different types of
> unifications and the experience was very similar: both developers and users
> (and maintainers) are much better off.

You believe that I am wrong in my assessment of unified projects, and I
obviously think you are mistaken and underestimating the cost and
effects of trying to merge the two.

Well I think we are just going to agree to disagree on this one. I am
not against merging projects where it makes sense, but in this
particular case I am strongly convinced the loss would be much greater
than the gain.

Cheers,
Jes

Frank Ch. Eigler

unread,

Mar 18, 2010, 9:30:02 AM3/18/10

to

Hi -

> > > [...]
> > > Distributions are very eager to update kernels even in stable periods of the
> > > distro lifetime - they are much less willing to update user-space packages.
> > > [...]
> >
> > Sorry, er, what? What distributions eagerly upgrade kernels in stable
> > periods, were it not primarily motivated by security fixes? [...]
>
> Please check the popular distro called 'Fedora' for example

I do believe I've heard of it. According to fedora bodhi, there have
been 18 kernel updates issues for fedora 11 since its release, of
which 12 were for purely security updates, and most of the other six
also contain security fixes. None are described as 'enhancement'
updates. Oh, what about fedora 12? 8 updates total, of which 5 are
security only, one for drm showstoppers, others including security
fixes, again 0 tagged as 'enhancement'.

So where is that "eagerness" again?? My sense is that most users are
happy to leave a stable kernel running as long as possible, and
distributions know this. You surely must understand that the lkml
demographics are different.

> and its kernel upgrade policies.

[citation needed]

> > [...] What users eagerly replace their kernels?
>
> Those 99% who click on the 'install 193 updates' popup.

That's not "eager". That's "I'm exasperated from guessing what's
really important; let's not have so many updates; meh".

- FChE

Ingo Molnar

unread,

Mar 18, 2010, 9:40:03 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/18/2010 03:02 PM, Ingo Molnar wrote:
> >
> >> [...] What users eagerly replace their kernels?
> >
> > Those 99% who click on the 'install 193 updates' popup.
> >
>
> Of which 1 is the kernel, and 192 are userspace updates (of which one may be
> qemu).

I think you didnt understand my (tersely explained) point - which is probably
my fault. What i said is:

- distros update the kernel first. Often in stable releases as well if
there's a new kernel released. (They must because it provides new hardware
enablement and other critical changes they generally cannot skip.)

- Qemu on the other hand is not upgraded with (nearly) that level of urgency.
Completely new versions will generally have to wait for the next distro
release.

With in-kernel tools the kernel and the tooling that accompanies the kernel
are upgraded in the same low-latency pathway. That is a big plus if you are
offering things like instrumentation (which perf does), which relates closely
to the kernel.

Furthermore, many distros package up the latest -git kernel as well. They
almost never do that with user-space packages.

Let me give you a specific example:

I'm running Fedora Rawhide with 2.6.34-rc1 right now on my main desktop, and
that comes with perf-2.6.34-0.10.rc1.git0.fc14.noarch.

My rawhide box has qemu-kvm-0.12.3-3.fc14.x86_64 installed. That's more than a
1000 Qemu commits older than the latest Qemu development branch.

So by being part of the kernel repo there's lower latency upgrades and earlier
and better testing available on most distros.

You made it very clear that you dont want that, but please dont try to claim
that those advantages do not exist - they are very much real and we are making
good use of it.

Thanks,

Ingo

Avi Kivity

unread,

Mar 18, 2010, 9:40:02 AM3/18/10

to

On 03/18/2010 03:00 PM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>> On 03/18/2010 01:48 PM, Ingo Molnar wrote:
>>
>>
>>>> It's not inevitable, if the projects are badly run, you'll have high
>>>> latencies, but projects don't have to be badly run.
>>>>
>>> So the 64K dollar question is, why does Qemu still suck?
>>>
>> Where people sent patches, it doesn't suck (or sucks less). Where they
>> don't, it still sucks. [...]
>>
> So is your point that the development process and basic code structure does
> not matter at all, it's just a matter of people sending patches? I beg to
> differ ...
>

The development process of course matters, and we have worked hard to
fix qemu's. Basic code structure also matters, but you don't fix that
with cp.

>> [...] And it cost way more than $64K.
>>
>> If moving things to tools/ helps, let's move Fedora to tools/.
>>
> Those bits of Fedora which deeply relate to the kernel - yes.
> Those bits that are arguably separate - nope.
>

A qemu GUI is not deeply related to the kernel. Or at all.

>>>> How is a patch for the qemu GUI eject button and the kvm shadow mmu
>>>> related? Should a single maintainer deal with both?
>>>>
>>> We have co-maintainers for perf that have a different focus. It works
>>> pretty well.
>>>
>> And it works well when I have patches that change x86 core and kvm. But
>> that's no longer a single repository and we have to coordinate.
>>
> Actually, it works much better if, contrary to your proposal it ends up in a
> single repo. Last i checked both of us really worked on such a project, run by
> some guy. (Named Linus or so.)
>

Well, when last I sent x86 patches, they went to you and hpa, applied to
tip, from which I had to merge them back. Two repositories. After
several weeks they did end up in a third repository, Linus'. The
process isn't trivial or fast, but it works.

I am not arguing for UML and don't understand why you think so.

>> [...] kvm in fact knows nothing about vga, to take your last
>> example. [...]
>>
> Look at the VGA dirty bitmap optimization a'ka the KVM_GET_DIRTY_LOG ioctl.
>
> See qemu/kvm-all.c's kvm_physical_sync_dirty_bitmap().
>
> It started out as a VGA optimization (also used by live migration) and even
> today it's mostly used by the VGA drivers - albeit a weak one.
>
> I wish there were stronger VGA optimizations implemented, copying the dirty
> bitmap is not a particularly performant solution.

The VGA dirty bitmap is 256 bytes in length. Copying it doesn't take
any time at all.

People are in fact working on a copy-less dirty bitmap solution, for
live migration of very large memory guests. Expect set_bit_user()
patches for tip.git.

> (although it's certainly
> better than full emulation) Graphics performance is one of the more painful
> aspects of KVM usability today.
>

If you have suggestions for further optimizations (or even patches) I'd
love to hear them.

One solution we are working on is QXL, a framebuffer-less graphics card
designed for spice. The use case is again server based (hosted
desktops) but may be adapted for desktop-on-desktop use.

>> [...] To suggest that qemu needs to be close to the kernel to benefit from
>> the kernel's timer implementation means we don't care about providing
>> quality timing except to ourselves, which luckily isn't the case.
>>
> That is not what i said. I said they are closely related, and where
> technologies are closely related, project proximity turns into project
> unification at a certain stage.
>

I really don't see how. So what if both qemu and kvm implement an
i8254? They can't share any code since the internal APIs are so
different. Even worse for the x86 emulator as qemu and kvm are
fundamentally different. Even more with the qemu timers and kernel
dyntick code.

>> Some time ago the various desktops needed directory change
>> notification, and people implemented inotify (or whatever it's
>> called today). No one suggested tools/gnome/ and tools/kde/.
>>
> You are misconstruing and misrepresenting my argument - i'd expect better.
> Gnome and KDE runs on other kernels as well and is generally not considered
> close to the kernel.
>

qemu runs on other kernels (including Windows), just without kvm.

> Do you seriously argue that Qemu has nothing to do with KVM these days?
>

The vast majority of qemu has nothing to do with kvm, all the kvm
interface bits are in two files. Things like the GUI, the VNC server,
IDE emulation, the management interface (the monitor), live migration,
qcow2 and ~15 other file format drivers, chipset emulation, USB
controller emulation, snapshot support, slirp, serial port emulation,
and a zillion other details have nothing to do with kvm.

>>> The quality of a development process is not defined by the easy cases
>>> where no project unification is needed. The quality of a development
>>> process is defined by the _difficult_ cases.
>>>
>> That's true, but we don't have issues at the qemu/kvm boundary. Note we do
>> have issues at the qemu/aio interfaces and qemu/net interfaces (out of which
>> vhost-net was born) but these wouldn't be solved by tools/qemu/.
>>
> That was not what i suggested. They would be solved by what i proposed:
> tools/kvm/, right?
>

If they were, it would be worth it.

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Mar 18, 2010, 9:50:02 AM3/18/10

to

On 03/18/2010 03:31 PM, Ingo Molnar wrote:
> * Avi Kivity<a...@redhat.com> wrote:
>
>
>> On 03/18/2010 03:02 PM, Ingo Molnar wrote:
>>
>>>
>>>> [...] What users eagerly replace their kernels?
>>>>
>>> Those 99% who click on the 'install 193 updates' popup.
>>>
>>>
>> Of which 1 is the kernel, and 192 are userspace updates (of which one may be
>> qemu).
>>
> I think you didnt understand my (tersely explained) point - which is probably
> my fault. What i said is:
>
> - distros update the kernel first. Often in stable releases as well if
> there's a new kernel released. (They must because it provides new hardware
> enablement and other critical changes they generally cannot skip.)
>

No, they don't. RHEL 5 is still on 2.6.18, for example. Users don't
like their kernels updated unless absolutely necessary, with good reason.

Kernel updates = reboots.

> - Qemu on the other hand is not upgraded with (nearly) that level of urgency.
> Completely new versions will generally have to wait for the next distro
> release.
>

F12 recently updated to 2.6.32. This is probably due to 2.6.31.stable
dropping away, and no capacity at Fedora to maintain it on their own.
So they are caught in a bind - stay on 2.6.31 and expose users to
security vulnerabilities or move to 2.6.32 and cause regressions. Not a
happy choice.

> With in-kernel tools the kernel and the tooling that accompanies the kernel
> are upgraded in the same low-latency pathway. That is a big plus if you are
> offering things like instrumentation (which perf does), which relates closely
> to the kernel.
>
> Furthermore, many distros package up the latest -git kernel as well. They
> almost never do that with user-space packages.
>

I'm sure if we ask the Fedora qemu maintainer to package qemu-kvm.git
they'll consider it favourably. Isn't that what rawhide is for?

> Let me give you a specific example:
>
> I'm running Fedora Rawhide with 2.6.34-rc1 right now on my main desktop, and
> that comes with perf-2.6.34-0.10.rc1.git0.fc14.noarch.
>
> My rawhide box has qemu-kvm-0.12.3-3.fc14.x86_64 installed. That's more than a
> 1000 Qemu commits older than the latest Qemu development branch.
>
> So by being part of the kernel repo there's lower latency upgrades and earlier
> and better testing available on most distros.
>
> You made it very clear that you dont want that, but please dont try to claim
> that those advantages do not exist - they are very much real and we are making
> good use of it.
>

I don't mind at all if rawhide users run on the latest and greatest, but
release users deserve a little more stability.

--
error compiling committee.c: too many arguments to function

--

Ingo Molnar

unread,

Mar 18, 2010, 9:50:02 AM3/18/10

to

* Frank Ch. Eigler <fc...@redhat.com> wrote:

> Hi -
>
> > > > [...]
> > > > Distributions are very eager to update kernels even in stable periods of the
> > > > distro lifetime - they are much less willing to update user-space packages.
> > > > [...]
> > >
> > > Sorry, er, what? What distributions eagerly upgrade kernels in stable
> > > periods, were it not primarily motivated by security fixes? [...]
> >
> > Please check the popular distro called 'Fedora' for example
>
> I do believe I've heard of it. According to fedora bodhi, there have
> been 18 kernel updates issues for fedora 11 since its release, of
> which 12 were for purely security updates, and most of the other six
> also contain security fixes. None are described as 'enhancement'
> updates. Oh, what about fedora 12? 8 updates total, of which 5 are
> security only, one for drm showstoppers, others including security
> fixes, again 0 tagged as 'enhancement'.
>
> So where is that "eagerness" again?? My sense is that most users are
> happy to leave a stable kernel running as long as possible, and
> distributions know this. You surely must understand that the lkml
> demographics are different.
>
> > and its kernel upgrade policies.
>
> [citation needed]

You are quite wrong, despite the sarcastic tone you are attempting to use, and
this is distro kernel policy 101.

For distros such as Fedora it's simpler to support the same kernel version
across many older versions of the distro than having to support different
kernel versions.

Check Fedora 12 for example. Four months ago it was released with kernel
v2.6.31:

http://download.fedora.redhat.com/pub/fedora/linux/releases/12/Fedora/x86_64/os/Packages/kernel-2.6.31.5-127.fc12.x86_64.rpm

But if you update a Fedora 12 installation today you'll get kernel v2.6.32:

http://download.fedora.redhat.com/pub/fedora/linux/updates/12/SRPMS/kernel-2.6.32.9-70.fc12.src.rpm

As a result you'll get a new 2.6.32 kernel on Fedora 12.

The end result is what i said in the previous mail: that you'll get a newer
kernel even on a stable distro - while user-space packages will only be
updated if there's a security issue (and even then there's no version jump
like for the kernel).

> > > [...] What users eagerly replace their kernels?
> >
> > Those 99% who click on the 'install 193 updates' popup.
>
> That's not "eager". That's "I'm exasperated from guessing what's really
> important; let's not have so many updates; meh".

Erm, fact is, 99% [WAG] of the users click on the update button and accept
whatever kernel version the distro update offers them.

Ingo

Daniel P. Berrange

unread,

Mar 18, 2010, 9:50:03 AM3/18/10

to

On Thu, Mar 18, 2010 at 02:31:24PM +0100, Ingo Molnar wrote:
>
> * Avi Kivity <a...@redhat.com> wrote:
>
> > On 03/18/2010 03:02 PM, Ingo Molnar wrote:
> > >
> > >> [...] What users eagerly replace their kernels?
> > >
> > > Those 99% who click on the 'install 193 updates' popup.
> > >
> >
> > Of which 1 is the kernel, and 192 are userspace updates (of which one may be
> > qemu).
>
> I think you didnt understand my (tersely explained) point - which is probably
> my fault. What i said is:
>
> - distros update the kernel first. Often in stable releases as well if
> there's a new kernel released. (They must because it provides new hardware
> enablement and other critical changes they generally cannot skip.)
>
> - Qemu on the other hand is not upgraded with (nearly) that level of urgency.
> Completely new versions will generally have to wait for the next distro
> release.

This has nothing todo with them being in separate source repos. We could
update QEMU to new major feature releaes with the same frequency in a Fedora
release, but we delibrately choose not to rebase the QEMU userspace because
experiance has shown the downside from new bugs / regressions outweighs the
benefit of any new features.

The QEMU updates in stable Fedora trees, now just follow the minor bugfix
release stream provided by QEMU & those arrive in Fedora with little
noticable delay.

Daniel
--
|: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Ingo Molnar

unread,

Mar 18, 2010, 10:00:02 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> On 03/18/2010 03:31 PM, Ingo Molnar wrote:
> >* Avi Kivity<a...@redhat.com> wrote:
> >
> >>On 03/18/2010 03:02 PM, Ingo Molnar wrote:
> >>>>[...] What users eagerly replace their kernels?
> >>>Those 99% who click on the 'install 193 updates' popup.
> >>>
> >>Of which 1 is the kernel, and 192 are userspace updates (of which one may be
> >>qemu).
> >I think you didnt understand my (tersely explained) point - which is probably
> >my fault. What i said is:
> >
> > - distros update the kernel first. Often in stable releases as well if
> > there's a new kernel released. (They must because it provides new hardware
> > enablement and other critical changes they generally cannot skip.)
>

> No, they don't. [...]

I just replied to Frank Ch. Eigler with a specific example that shows how this
happens - and believe me, it happens.

> [...] RHEL 5 is still on 2.6.18, for example. Users

> don't like their kernels updated unless absolutely necessary, with
> good reason.

Nope - RHEL 5 is on a 2.6.18 base for entirely different reasons.

> Kernel updates = reboots.

If you check the update frequency of RHEL 5 kernels you'll see that it's
comparable to that of Fedora.

> > - Qemu on the other hand is not upgraded with (nearly) that level of urgency.
> > Completely new versions will generally have to wait for the next distro
> > release.
>
> F12 recently updated to 2.6.32. This is probably due to 2.6.31.stable
> dropping away, and no capacity at Fedora to maintain it on their own. So
> they are caught in a bind - stay on 2.6.31 and expose users to security
> vulnerabilities or move to 2.6.32 and cause regressions. Not a happy
> choice.

Happy choice or not, this is what i said is the distro practice these days. (i
dont know all the distros that well so i'm sure there's differences)

> > With in-kernel tools the kernel and the tooling that accompanies the kernel
> > are upgraded in the same low-latency pathway. That is a big plus if you are
> > offering things like instrumentation (which perf does), which relates closely
> > to the kernel.
> >
> > Furthermore, many distros package up the latest -git kernel as well. They
> > almost never do that with user-space packages.
>
> I'm sure if we ask the Fedora qemu maintainer to package qemu-kvm.git
> they'll consider it favourably. Isn't that what rawhide is for?

Rawhide is generally for latest released versions, to ready them for the next
distro release - with special exception for the kernel, which has a special
position due being a hardware-enabler and because it has an extremely
predictable release schedule of every 90 days (+- 10 days).

Very rarely do distro people jump versions for things like GCC or Xorg or
Gnome/KDE, but they've been burned enough times by unexpected delays in those
projects to be really loathe to do it.

Qemu might get an exception - dunno, you could ask. My point still holds: by
hosting KVM user-space bits in the kernel together with the rest of KVM you
get version parity - which has clear advantages.

You also might have more luck with a bleeding-edge distro such as Gentoo.

> >Let me give you a specific example:
> >
> >I'm running Fedora Rawhide with 2.6.34-rc1 right now on my main desktop, and
> >that comes with perf-2.6.34-0.10.rc1.git0.fc14.noarch.
> >
> >My rawhide box has qemu-kvm-0.12.3-3.fc14.x86_64 installed. That's more than a
> >1000 Qemu commits older than the latest Qemu development branch.
> >
> >So by being part of the kernel repo there's lower latency upgrades and earlier
> >and better testing available on most distros.
> >
> >You made it very clear that you dont want that, but please dont try to claim
> >that those advantages do not exist - they are very much real and we are making
> >good use of it.
>
> I don't mind at all if rawhide users run on the latest and greatest, but
> release users deserve a little more stability.

What are you suggesting, that released versions of KVM are not reliable? Of
course any tools/ bits are release engineered just as much as the rest of KVM
...

Ingo

John Kacur

unread,

Mar 18, 2010, 10:10:03 AM3/18/10

to

On Thu, Mar 18, 2010 at 2:59 PM, Ingo Molnar <mi...@elte.hu> wrote:

>
> * Daniel P. Berrange <berr...@redhat.com> wrote:
>
>> On Thu, Mar 18, 2010 at 02:31:24PM +0100, Ingo Molnar wrote:
>> >
>> > * Avi Kivity <a...@redhat.com> wrote:
>> >
>> > > On 03/18/2010 03:02 PM, Ingo Molnar wrote:
>> > > >
>> > > >> [...] What users eagerly replace their kernels?
>> > > >
>> > > > Those 99% who click on the 'install 193 updates' popup.
>> > > >
>> > >
>> > > Of which 1 is the kernel, and 192 are userspace updates (of which one may be
>> > > qemu).
>> >
>> > I think you didnt understand my (tersely explained) point - which is probably
>> > my fault. What i said is:
>> >
>> > - distros update the kernel first. Often in stable releases as well if
>> > there's a new kernel released. (They must because it provides new hardware
>> > enablement and other critical changes they generally cannot skip.)
>> >
>> > - Qemu on the other hand is not upgraded with (nearly) that level of urgency.
>> > Completely new versions will generally have to wait for the next distro
>> > release.
>>
>> This has nothing todo with them being in separate source repos. We could
>> update QEMU to new major feature releaes with the same frequency in a Fedora
>> release, but we delibrately choose not to rebase the QEMU userspace because
>> experiance has shown the downside from new bugs / regressions outweighs the
>> benefit of any new features.
>>
>> The QEMU updates in stable Fedora trees, now just follow the minor bugfix
>> release stream provided by QEMU & those arrive in Fedora with little
>> noticable delay.
>

> That is exactly what i said: Qemu and most user-space packages are on a
> 'slower' update track than the kernel: generally updated for minor releases.
>
> My further point was that the kernel on the other hand gets updated more
> frequently and as such, any user-space tool bits hosted in the kernel repo get
> updated more frequently as well.
>
> Thanks,
>
> Ingo

Just to play devil's advocate, let's not mix up the development model with the
distribution model. There is nothing to stop packagers and distributors from
providing separate kernel "proper" packages and perf tools packages.

It might even make good sense assuming backwards compatibility for distros
that have conservative policies about new kernel versions to provide newer
perf tools packages with older kernels.

John

Ingo Molnar

unread,

Mar 18, 2010, 10:10:02 AM3/18/10

to

* Daniel P. Berrange <berr...@redhat.com> wrote:

> On Thu, Mar 18, 2010 at 02:31:24PM +0100, Ingo Molnar wrote:
> >
> > * Avi Kivity <a...@redhat.com> wrote:
> >
> > > On 03/18/2010 03:02 PM, Ingo Molnar wrote:
> > > >
> > > >> [...] What users eagerly replace their kernels?
> > > >
> > > > Those 99% who click on the 'install 193 updates' popup.
> > > >
> > >
> > > Of which 1 is the kernel, and 192 are userspace updates (of which one may be
> > > qemu).
> >
> > I think you didnt understand my (tersely explained) point - which is probably
> > my fault. What i said is:
> >
> > - distros update the kernel first. Often in stable releases as well if
> > there's a new kernel released. (They must because it provides new hardware
> > enablement and other critical changes they generally cannot skip.)
> >
> > - Qemu on the other hand is not upgraded with (nearly) that level of urgency.
> > Completely new versions will generally have to wait for the next distro
> > release.
>
> This has nothing todo with them being in separate source repos. We could
> update QEMU to new major feature releaes with the same frequency in a Fedora
> release, but we delibrately choose not to rebase the QEMU userspace because
> experiance has shown the downside from new bugs / regressions outweighs the
> benefit of any new features.
>
> The QEMU updates in stable Fedora trees, now just follow the minor bugfix
> release stream provided by QEMU & those arrive in Fedora with little
> noticable delay.

That is exactly what i said: Qemu and most user-space packages are on a

'slower' update track than the kernel: generally updated for minor releases.

My further point was that the kernel on the other hand gets updated more
frequently and as such, any user-space tool bits hosted in the kernel repo get
updated more frequently as well.

Thanks,

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 10:20:02 AM3/18/10

to

* Avi Kivity <a...@redhat.com> wrote:

> > That is not what i said. I said they are closely related, and where
> > technologies are closely related, project proximity turns into project
> > unification at a certain stage.
>
> I really don't see how. So what if both qemu and kvm implement an i8254?

> They can't share any code since the internal APIs are so different. [...]

I wouldnt jump to assumptions there. perf shares some facilities with the
kernel on the source code level - they can be built both in the kernel and in
user-space.

But my main thought wasnt even to actually share the implementation - but to
actually synchronize when a piece of device emulation moves into the kernel.
It is arguably bad for performance in most cases when Qemu handles a given
device - so all the common devices should be kernel accelerated.

The version and testing matrix would be simplified significantly as well: as
kernel and qemu goes hand in hand, they are always on the same version.

> [...] Even worse for the x86 emulator as qemu and kvm are fundamentally
> different.

So is it your argument that the difference and the duplication in x86
instruction emulation is a good thing? You said it some time ago that
the kvm x86 emulator was very messy and you wish it was cleaner.

While qemu's is indeed rather different (it's partly a translator/JIT), i'm
sure the decoder logic could be shared - and qemu has a slow-path
full-emulation fallback in any case, which is similar to what in-kernel
emulator does (IIRC ...).

That might have changed meanwhile.

Ingo

Ingo Molnar

unread,

Mar 18, 2010, 10:20:02 AM3/18/10

to

* John Kacur <jka...@redhat.com> wrote:

> On Thu, Mar 18, 2010 at 2:59 PM, Ingo Molnar <mi...@elte.hu> wrote:
> >
> > * Daniel P. Berrange <berr...@redhat.com> wrote:
> >
> >> On Thu, Mar 18, 2010 at 02:31:24PM +0100, Ingo Molnar wrote:
> >> >
> >> > * Avi Kivity <a...@redhat.com> wrote:
> >> >
> >> > > On 03/18/2010 03:02 PM, Ingo Molnar wrote:
> >> > > >
> >> > > >> [...] What users eagerly replace their kernels?
> >> > > >
> >> > > > Those 99% who click on the 'install 193 updates' popup.
> >> > > >
> >> > >
> >> > > Of which 1 is the kernel, and 192 are userspace updates (of which one may be
> >> > > qemu).
> >> >
> >> > I think you didnt understand my (tersely explained) point - which is probably
> >> > my fault. What i said is:
> >> >

> >> > ?- distros update the kernel first. Often in stable releases as well if
> >> > ? ?there's a new kernel released. (They must because it provides new hardware
> >> > ? ?enablement and other critical changes they generally cannot skip.)
> >> >
> >> > ?- Qemu on the other hand is not upgraded with (nearly) that level of urgency.
> >> > ? ?Completely new versions will generally have to wait for the next distro
> >> > ? ?release.

> >>
> >> This has nothing todo with them being in separate source repos. We could
> >> update QEMU to new major feature releaes with the same frequency in a Fedora
> >> release, but we delibrately choose not to rebase the QEMU userspace because
> >> experiance has shown the downside from new bugs / regressions outweighs the
> >> benefit of any new features.
> >>
> >> The QEMU updates in stable Fedora trees, now just follow the minor bugfix
> >> release stream provided by QEMU & those arrive in Fedora with little
> >> noticable delay.
> >
> > That is exactly what i said: Qemu and most user-space packages are on a
> > 'slower' update track than the kernel: generally updated for minor releases.
> >
> > My further point was that the kernel on the other hand gets updated more
> > frequently and as such, any user-space tool bits hosted in the kernel repo get
> > updated more frequently as well.
> >
> > Thanks,
> >

> > ? ? ? ?Ingo

>
> Just to play devil's advocate, let's not mix up the development model with
> the distribution model. There is nothing to stop packagers and distributors
> from providing separate kernel "proper" packages and perf tools packages.
>
> It might even make good sense assuming backwards compatibility for distros
> that have conservative policies about new kernel versions to provide newer
> perf tools packages with older kernels.

Of course. Some distros are also very conservative about updating the kernel
at all.

I'm mostly talking about the distros that are at the frontier of kernel
development: those with fresh packages, those which provide eager
bleeding-edge testers and developers.

Ingo