[RFC][PATCH 0/3] tracing: Add Hardware Latency detector tracer

Steven Rostedt

unread,

Aug 4, 2016, 11:07:38 AM8/4/16

to linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

This adds the PREEMPT_RT hwlat detector as a Linux tracer in mainline.
In the PREEMPT_RT patch set, it is a separate entity that is controlled
by the debugfs file system. I found that it is better suited as a
latency tracer in the tracing directory, as it follows pretty much the
same paradigm as the other latency tracers that already exist. All
I had to add was a hwlat_detector directory that contained a window
and width for the period and duration respectively of the test. But
the samples would just write to the tracing ring buffer and the max
latency would be stored in tracing_max_latency, and the threshold can
be set by the existing tracing_threshold. The last patch also adds a
new feature that would have the kthread migrate after each period to
another CPU specified by tracing_cpumask.

Here's the documention that is added to hwlat_detector.txt in the
Documentation directory:

Introduction:
-------------

The tracer hwlat_detector is a special purpose tracer that is used to
detect large system latencies induced by the behavior of certain underlying
hardware or firmware, independent of Linux itself. The code was developed
originally to detect SMIs (System Management Interrupts) on x86 systems,
however there is nothing x86 specific about this patchset. It was
originally written for use by the "RT" patch since the Real Time
kernel is highly latency sensitive.

SMIs are not serviced by the Linux kernel, which means that it does not
even know that they are occuring. SMIs are instead set up by BIOS code
and are serviced by BIOS code, usually for "critical" events such as
management of thermal sensors and fans. Sometimes though, SMIs are used for
other tasks and those tasks can spend an inordinate amount of time in the
handler (sometimes measured in milliseconds). Obviously this is a problem if
you are trying to keep event service latencies down in the microsecond range.

The hardware latency detector works by hogging one of the cpus for configurable
amounts of time (with interrupts disabled), polling the CPU Time Stamp Counter
for some period, then looking for gaps in the TSC data. Any gap indicates a
time when the polling was interrupted and since the interrupts are disabled,
the only thing that could do that would be an SMI or other hardware hiccup
(or an NMI, but those can be tracked).

Note that the hwlat detector should *NEVER* be used in a production environment.
It is intended to be run manually to determine if the hardware platform has a
problem with long system firmware service routines.

Usage:
------

Write the ASCII text "hwlat" into the current_tracer file of the tracing system
(mounted at /sys/kernel/tracing or /sys/kernel/tracing). It is possible to
redefine the threshold in microseconds (us) above which latency spikes will
be taken into account.

Example:

# echo hwlat > /sys/kernel/tracing/current_tracer
# echo 100 > /sys/kernel/tracing/tracing_thresh

The /sys/kernel/tracing/hwlat_detector interface contains the following files:

width - time period to sample with CPUs held (usecs)
must be less than the total window size (enforced)
window - total period of sampling, width being inside (usecs)

By default the width is set to 500,000 and window to 1,000,000, meaning that
for every 1,000,000 usecs (1s) the hwlat detector will spin for 500,000 usecs
(0.5s). If tracing_thresh contains zero when hwlat tracer is enabled, it will
change to a default of 10 usecs. If any latencies that exceed the threshold is
observed then the data will be written to the tracing ring buffer.

The minimum sleep time between periods is 1 millisecond. Even if width
is less than 1 millisecond apart from window, to allow the system to not
be totally starved.

If tracing_thresh was zero when hwlat detector was started, it will be set
back to zero if another tracer is loaded. Note, the last value in
tracing_thresh that hwlat detector had will be saved and this value will
be restored in tracing_thresh if it is still zero when hwlat detector is
started again.

The following tracing directory files are used by the hwlat_detector:

in /sys/kernel/tracing:

tracing_threshold - minimum latency value to be considered (usecs)
tracing_max_latency - maximum hardware latency actually observed (usecs)
hwlat_detector/width - specified amount of time to spin within window (usecs)
hwlat_detector/window - amount of time between (width) runs (usecs)

-- Steve

Jon Masters (1):
tracing: Add documentation for hwlat_detector tracer

Steven Rostedt (Red Hat) (2):
tracing: Added hardware latency tracer
tracing: Have hwlat trace migrate across tracing_cpumask CPUs

----
Documentation/trace/hwlat_detector.txt | 79 +++++
kernel/trace/Kconfig | 35 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 2 +-
kernel/trace/trace.h | 3 +
kernel/trace/trace_entries.h | 23 ++
kernel/trace/trace_hwlat.c | 582 +++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 52 +++
8 files changed, 776 insertions(+), 1 deletion(-)
create mode 100644 Documentation/trace/hwlat_detector.txt
create mode 100644 kernel/trace/trace_hwlat.c

Steven Rostedt

unread,

Aug 4, 2016, 11:07:54 AM8/4/16

to linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

0001-tracing-Added-hardware-latency-tracer.patch

Steven Rostedt

unread,

Aug 4, 2016, 11:30:46 AM8/4/16

to linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

Note, I'm currently working on adding code to detect NMIs to this as
well. And perhaps even tracing SMI counters. Just to show what caused
the latency, as latency isn't measured by the counters (that I know of).

-- Steve

Steven Rostedt

unread,

Aug 4, 2016, 12:56:35 PM8/4/16

to linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

As NMIs can also cause latency when interrupts are disabled, the hwlat
detectory has no way to know if the latency it detects is from an NMI or an
SMI or some other hardware glitch.

As ftrace_nmi_enter/exit() funtions are no longer used (except for sh, which
isn't supported anymore), I converted those to "arch_ftrace_nmi_enter/exit"
and use ftrace_nmi_enter/exit() to check if hwlat detector is tracing or
not, and if so, it calls into the hwlat utility.

Since the hwlat detector only has a single kthread that is spinning with
interrupts disabled, it marks what CPU it is on, and if the NMI callback
happens on that CPU, it records the time spent in that NMI. This is added to
the output that is generated by the hwlat detector as:

[1] inner:0 outer:8 ts:1470329211.586481492 nmi-total:0 nmi-count:0
[2] inner:0 outer:4 ts:1470329231.619726657 nmi-total:4 nmi-count:1
[3] inner:0 outer:5 ts:1470329234.550761052 nmi-total:4 nmi-count:1
[4] inner:8 outer:0 ts:1470329263.655911668 nmi-total:0 nmi-count:0

All time is still tracked in microseconds.

Is the nmi-count even required? I could also modify it to not show the
NMI information if there wasn't any NMIs.

Signed-off-by: Steven Rostedt <ros...@goodmis.org>
---
arch/sh/kernel/ftrace.c | 4 ++--
include/linux/ftrace_irq.h | 31 +++++++++++++++++++++++++++----
kernel/trace/trace_entries.h | 8 ++++++--
kernel/trace/trace_hwlat.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 6 ++++--
5 files changed, 83 insertions(+), 10 deletions(-)

diff --git a/arch/sh/kernel/ftrace.c b/arch/sh/kernel/ftrace.c
index 38993e09ef03..deb9eba3a43d 100644
--- a/arch/sh/kernel/ftrace.c
+++ b/arch/sh/kernel/ftrace.c
@@ -139,7 +139,7 @@ static void ftrace_mod_code(void)
clear_mod_flag();
}

-void ftrace_nmi_enter(void)
+void arch_ftrace_nmi_enter(void)
{
if (atomic_inc_return(&nmi_running) & MOD_CODE_WRITE_FLAG) {
smp_rmb();
@@ -150,7 +150,7 @@ void ftrace_nmi_enter(void)
smp_mb();
}

-void ftrace_nmi_exit(void)
+void arch_ftrace_nmi_exit(void)
{
/* Finish all executions before clearing nmi_running */
smp_mb();
diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
index dca7bf8cffe2..4ec2c9b205f2 100644
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -3,11 +3,34 @@

#ifdef CONFIG_FTRACE_NMI_ENTER
-extern void ftrace_nmi_enter(void);
-extern void ftrace_nmi_exit(void);
+extern void arch_ftrace_nmi_enter(void);
+extern void arch_ftrace_nmi_exit(void);
#else
-static inline void ftrace_nmi_enter(void) { }
-static inline void ftrace_nmi_exit(void) { }
+static inline void arch_ftrace_nmi_enter(void) { }
+static inline void arch_ftrace_nmi_exit(void) { }
#endif

+#ifdef CONFIG_HWLAT_TRACER
+extern bool trace_hwlat_callback_enabled;
+extern void trace_hwlat_callback(bool enter);
+#endif
+
+static inline void ftrace_nmi_enter(void)
+{
+#ifdef CONFIG_HWLAT_TRACER
+ if (trace_hwlat_callback_enabled)
+ trace_hwlat_callback(true);
+#endif
+ arch_ftrace_nmi_enter();
+}
+
+static inline void ftrace_nmi_exit(void)
+{
+ arch_ftrace_nmi_exit();
+#ifdef CONFIG_HWLAT_TRACER
+ if (trace_hwlat_callback_enabled)
+ trace_hwlat_callback(false);
+#endif
+}
+
#endif /* _LINUX_FTRACE_IRQ_H */
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 70d47dd99359..d1cc37e78f99 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -330,18 +330,22 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
F_STRUCT(
__field( u64, duration )
__field( u64, outer_duration )
+ __field( u64, nmi_total_ts )
__field_struct( struct timespec, timestamp )
__field_desc( long, timestamp, tv_sec )
__field_desc( long, timestamp, tv_nsec )
+ __field( unsigned int, nmi_count )
__field( unsigned int, seqnum )
),

- F_printk("cnt:%u\tts:%010lu.%010lu\tinner:%llu\touter:%llu\n",
+ F_printk("cnt:%u\tts:%010lu.%010lu\tinner:%llu\touter:%llunmi-ts:%llu\tnmi-count:%u\n",
__entry->seqnum,
__entry->tv_sec,
__entry->tv_nsec,
__entry->duration,
- __entry->outer_duration),
+ __entry->outer_duration,
+ __entry->nmi_total_ts,
+ __entry->nmi_count),

FILTER_OTHER
);
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index f0b5b2944ee7..0c7ca6082721 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -64,6 +64,15 @@ static struct dentry *hwlat_sample_window; /* sample window us */
/* Save the previous tracing_thresh value */
static unsigned long save_tracing_thresh;

+/* NMI timestamp counters */
+static u64 nmi_ts_start;
+static u64 nmi_total_ts;
+static int nmi_count;
+static int nmi_cpu;
+
+/* Tells NMIs to call back to the hwlat tracer to record timestamps */
+bool trace_hwlat_callback_enabled;
+
/* If the user changed threshold, remember it */
static u64 last_tracing_thresh = DEFAULT_LAT_THRESHOLD * NSEC_PER_USEC;

@@ -72,7 +81,9 @@ struct hwlat_sample {
u64 seqnum; /* unique sequence */
u64 duration; /* delta */
u64 outer_duration; /* delta (outer loop) */
+ u64 nmi_total_ts; /* Total time spent in NMIs */
struct timespec timestamp; /* wall time */
+ int nmi_count; /* # NMIs during this sample */
};

/* keep the global state somewhere. */
@@ -112,6 +123,8 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
entry->duration = sample->duration;
entry->outer_duration = sample->outer_duration;
entry->timestamp = sample->timestamp;
+ entry->nmi_total_ts = sample->nmi_total_ts;
+ entry->nmi_count = sample->nmi_count;

if (!call_filter_check_discard(call, entry, buffer, event))
__buffer_unlock_commit(buffer, event);
@@ -125,6 +138,19 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
#define init_time(a, b) (a = b)
#define time_u64(a) a

+void trace_hwlat_callback(bool enter)
+{
+ if (smp_processor_id() != nmi_cpu)
+ return;
+
+ if (enter)
+ nmi_ts_start = time_get();
+ else {
+ nmi_total_ts = time_get() - nmi_ts_start;
+ nmi_count++;
+ }
+}
+
/**
* get_sample - sample the CPU TSC and look for likely hardware latencies
*
@@ -144,6 +170,14 @@ static int get_sample(void)

do_div(thresh, NSEC_PER_USEC); /* modifies interval value */

+ nmi_cpu = smp_processor_id();
+ nmi_total_ts = 0;
+ nmi_count = 0;
+ /* Make sure NMIs see this first */
+ barrier();
+
+ trace_hwlat_callback_enabled = true;
+
init_time(last_t2, 0);
start = time_get(); /* start timestamp */

@@ -188,6 +222,10 @@ static int get_sample(void)

} while (total <= hwlat_data.sample_width);

+ barrier(); /* finish the above in the view for NMIs */
+ trace_hwlat_callback_enabled = false;
+ barrier(); /* Make sure nmi_total_ts is no longer updated */
+
ret = 0;

/* If we exceed the threshold value, we have found a hardware latency */
@@ -196,11 +234,17 @@ static int get_sample(void)

ret = 1;

+ /* We read in microseconds */
+ if (nmi_total_ts)
+ do_div(nmi_total_ts, NSEC_PER_USEC);
+
hwlat_data.count++;
s.seqnum = hwlat_data.count;
s.duration = sample;
s.outer_duration = outer_sample;
s.timestamp = CURRENT_TIME;
+ s.nmi_total_ts = nmi_total_ts;
+ s.nmi_count = nmi_count;
trace_hwlat_sample(&s);

/* Keep a running maximum ever recorded hardware latency */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index cc5db62a3929..892359b75b61 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1109,12 +1109,14 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,

trace_assign_type(field, entry);

- trace_seq_printf(s, "[%u] inner:%llu outer:%lld ts:%ld.%09ld\n",
+ trace_seq_printf(s, "[%u] inner:%llu outer:%lld ts:%ld.%09ld nmi-total:%llu nmi-count:%u\n",
field->seqnum,
field->duration,
field->outer_duration,
field->timestamp.tv_sec,
- field->timestamp.tv_nsec);
+ field->timestamp.tv_nsec,
+ field->nmi_total_ts,
+ field->nmi_count);

return trace_handle_return(s);
}
--
1.9.3

Steven Rostedt

unread,

Aug 4, 2016, 1:16:56 PM8/4/16

to linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

On Thu, 4 Aug 2016 12:56:18 -0400
Steven Rostedt <ros...@goodmis.org> wrote:

> Is the nmi-count even required? I could also modify it to not show the
> NMI information if there wasn't any NMIs.

Yeah, only showing NMI info when nmi_count is > 0 is much nicer.
Updated patch:

---

As NMIs can also cause latency when interrupts are disabled, the hwlat
detectory has no way to know if the latency it detects is from an NMI or an
SMI or some other hardware glitch.

As ftrace_nmi_enter/exit() funtions are no longer used (except for sh, which
isn't supported anymore), I converted those to "arch_ftrace_nmi_enter/exit"
and use ftrace_nmi_enter/exit() to check if hwlat detector is tracing or
not, and if so, it calls into the hwlat utility.

Since the hwlat detector only has a single kthread that is spinning with
interrupts disabled, it marks what CPU it is on, and if the NMI callback
happens on that CPU, it records the time spent in that NMI. This is added to
the output that is generated by the hwlat detector as:

[1] inner:0 outer:8 ts:1470329211.586481492

[2] inner:0 outer:4 ts:1470329231.619726657 nmi-total:4 nmi-count:1
[3] inner:0 outer:5 ts:1470329234.550761052 nmi-total:4 nmi-count:1
[4] inner:8 outer:0 ts:1470329263.655911668

All time is still tracked in microseconds.

The NMI information is only shown when an NMI occurred during the sample.

Signed-off-by: Steven Rostedt <ros...@goodmis.org>
---
arch/sh/kernel/ftrace.c | 4 ++--
include/linux/ftrace_irq.h | 31 +++++++++++++++++++++++++++----
kernel/trace/trace_entries.h | 8 ++++++--
kernel/trace/trace_hwlat.c | 44 ++++++++++++++++++++++++++++++++++++++++++++

kernel/trace/trace_output.c | 8 +++++++-
5 files changed, 86 insertions(+), 9 deletions(-)

index cc5db62a3929..c737d4d304f4 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1109,12 +1109,18 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,

trace_assign_type(field, entry);

- trace_seq_printf(s, "[%u] inner:%llu outer:%lld ts:%ld.%09ld\n",

+ trace_seq_printf(s, "[%u] inner:%llu outer:%lld ts:%ld.%09ld",

field->seqnum,
field->duration,
field->outer_duration,
field->timestamp.tv_sec,

field->timestamp.tv_nsec);
+ if (field->nmi_count)
+ trace_seq_printf(s, " nmi-total:%llu nmi-count:%u",

+ field->nmi_total_ts,
+ field->nmi_count);

+
+ trace_seq_putc(s, '\n');

return trace_handle_return(s);
}
--
1.9.3

Sebastian Andrzej Siewior

unread,

Aug 5, 2016, 10:25:34 AM8/5/16

to Steven Rostedt, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

* Steven Rostedt | 2016-08-04 10:57:09 [-0400]:

>diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
>new file mode 100644
>index 000000000000..08dfabe4e862
>--- /dev/null
>+++ b/kernel/trace/trace_hwlat.c
…
>+/* Macros to encapsulate the time capturing infrastructure */
>+#define time_type u64
>+#define time_get() trace_clock_local()
>+#define time_to_us(x) div_u64(x, 1000)
>+#define time_sub(a, b) ((a) - (b))
>+#define init_time(a, b) (a = b)
>+#define time_u64(a) a

Do we need a macro for this? In the old code we could choose between
CONFIG_TRACING but now we don't.

Sebastian

Sebastian Andrzej Siewior

unread,

Aug 5, 2016, 10:36:07 AM8/5/16

to Steven Rostedt, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

* Steven Rostedt | 2016-08-04 13:16:45 [-0400]:

>diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
>index dca7bf8cffe2..4ec2c9b205f2 100644
>--- a/include/linux/ftrace_irq.h
>+++ b/include/linux/ftrace_irq.h
>@@ -3,11 +3,34 @@

…

>+static inline void ftrace_nmi_enter(void)
>+{
>+#ifdef CONFIG_HWLAT_TRACER
>+ if (trace_hwlat_callback_enabled)
>+ trace_hwlat_callback(true);

so we take a tracepoint while we enter an nmi

>--- a/kernel/trace/trace_hwlat.c
>+++ b/kernel/trace/trace_hwlat.c
>@@ -64,6 +64,15 @@ static struct dentry *hwlat_sample_window; /* sample window us */
> /* Save the previous tracing_thresh value */
> static unsigned long save_tracing_thresh;
>
>+/* NMI timestamp counters */
>+static u64 nmi_ts_start;
>+static u64 nmi_total_ts;
>+static int nmi_count;
>+static int nmi_cpu;

and this is always limited to one CPU at a time?

…

>@@ -125,6 +138,19 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
> #define init_time(a, b) (a = b)
> #define time_u64(a) a
>
>+void trace_hwlat_callback(bool enter)
>+{
>+ if (smp_processor_id() != nmi_cpu)
>+ return;
>+
>+ if (enter)
>+ nmi_ts_start = time_get();

but more interestingly: trace_clock_local() -> sched_clock()
and of kernel/time/sched_clock.c we do raw_read_seqcount(&cd.seq) which
means we are busted if the NMI triggers during update_clock_read_data().

>+ else {
>+ nmi_total_ts = time_get() - nmi_ts_start;
>+ nmi_count++;
>+ }
>+}

Sebastian

Steven Rostedt

unread,

Aug 5, 2016, 10:44:30 AM8/5/16

to Sebastian Andrzej Siewior, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

Probably not, I kept it for two reasons. 1) to keep the same logic as
what was in PREEMPT_RT, and 2) in case we can come up with a better
clock.

But it's not that important. Should it be nuked? They do somewhat make
the code easier to read.

-- Steve

Steven Rostedt

unread,

Aug 5, 2016, 10:52:18 AM8/5/16

to Sebastian Andrzej Siewior, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

On Fri, 5 Aug 2016 16:35:55 +0200

Sebastian Andrzej Siewior <big...@linutronix.de> wrote:

> * Steven Rostedt | 2016-08-04 13:16:45 [-0400]:
>
> >diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
> >index dca7bf8cffe2..4ec2c9b205f2 100644
> >--- a/include/linux/ftrace_irq.h
> >+++ b/include/linux/ftrace_irq.h
> >@@ -3,11 +3,34 @@
> …
> >+static inline void ftrace_nmi_enter(void)
> >+{
> >+#ifdef CONFIG_HWLAT_TRACER
> >+ if (trace_hwlat_callback_enabled)
> >+ trace_hwlat_callback(true);
>
> so we take a tracepoint while we enter an nmi

It's not technically a tracepoint. I'm not sure tracepoints
(jumplabels) may be located this early in the NMI handler. This is
before some of the magic of having NMIs dealing with page faults and
break points.

>
> >--- a/kernel/trace/trace_hwlat.c
> >+++ b/kernel/trace/trace_hwlat.c
> >@@ -64,6 +64,15 @@ static struct dentry *hwlat_sample_window; /* sample window us */
> > /* Save the previous tracing_thresh value */
> > static unsigned long save_tracing_thresh;
> >
> >+/* NMI timestamp counters */
> >+static u64 nmi_ts_start;
> >+static u64 nmi_total_ts;
> >+static int nmi_count;
> >+static int nmi_cpu;
>
> and this is always limited to one CPU at a time?

Yes. Hence the "nmi_cpu".

>
> …
> >@@ -125,6 +138,19 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
> > #define init_time(a, b) (a = b)
> > #define time_u64(a) a
> >
> >+void trace_hwlat_callback(bool enter)
> >+{
> >+ if (smp_processor_id() != nmi_cpu)
> >+ return;
> >+
> >+ if (enter)
> >+ nmi_ts_start = time_get();
>
> but more interestingly: trace_clock_local() -> sched_clock()
> and of kernel/time/sched_clock.c we do raw_read_seqcount(&cd.seq) which
> means we are busted if the NMI triggers during update_clock_read_data().

Hmm, interesting. Because this is true for general tracing from an NMI.

/me looks at code.

Ah, this is when we have GENERIC_SCHED_CLOCK, which would break tracing
if any arch that has this also has NMIs. Probably need to look at arm64.

For x86, it has its own NMI safe sched_clock. I could make this "NMI"
code depend on:

#ifndef CONFIG_GENERIC_SCHED_CLOCK

-- Steve

Sebastian Andrzej Siewior

unread,

Aug 5, 2016, 11:28:56 AM8/5/16

to Steven Rostedt, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

On 08/05/2016 04:44 PM, Steven Rostedt wrote:
> On Fri, 5 Aug 2016 16:25:21 +0200
> Sebastian Andrzej Siewior <big...@linutronix.de> wrote:
>
>> * Steven Rostedt | 2016-08-04 10:57:09 [-0400]:
>>
>>> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
>>> new file mode 100644
>>> index 000000000000..08dfabe4e862
>>> --- /dev/null
>>> +++ b/kernel/trace/trace_hwlat.c
>> …
>>> +/* Macros to encapsulate the time capturing infrastructure */
>>> +#define time_type u64
>>> +#define time_get() trace_clock_local()
>>> +#define time_to_us(x) div_u64(x, 1000)
>>> +#define time_sub(a, b) ((a) - (b))
>>> +#define init_time(a, b) (a = b)
>>> +#define time_u64(a) a
>>
>> Do we need a macro for this? In the old code we could choose between
>> CONFIG_TRACING but now we don't.
>>
>
> Probably not, I kept it for two reasons. 1) to keep the same logic as
> what was in PREEMPT_RT, and 2) in case we can come up with a better
> clock.

I assumed it was a leftover.

> But it's not that important. Should it be nuked? They do somewhat make
> the code easier to read.

that time_get() is close to ktime_get() which is almost u64 nowadays.
So it might not be that cool for upstream. A hwlat prefix makes the
whole thing not prettier.

1. PREEMPT_RT. Do I need any changes? I assumed I could keep this 1:1
(once it is merged) and throw the current hwlat out.
2. a better clock is an argument. But why would you have a better clock
for hwlat and not for the whole tracing infrastructure?

If you want to keep it, keep it. I just assumed it was a leftover.
>
> -- Steve

Sebastian

Steven Rostedt

unread,

Aug 5, 2016, 11:36:54 AM8/5/16

to Sebastian Andrzej Siewior, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

On Fri, 5 Aug 2016 17:28:33 +0200
Sebastian Andrzej Siewior <big...@linutronix.de> wrote:

> If you want to keep it, keep it. I just assumed it was a leftover.

I have mixed feelings about it. I may just nuke it anyway. I forgot to
add #3 laziness, it worked and I didn't want to add bugs by removing it.

--Steve

Sebastian Andrzej Siewior

unread,

Aug 5, 2016, 11:41:00 AM8/5/16

to Steven Rostedt, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

On 08/05/2016 04:52 PM, Steven Rostedt wrote:
>>> --- a/kernel/trace/trace_hwlat.c
>>> +++ b/kernel/trace/trace_hwlat.c
>>> @@ -64,6 +64,15 @@ static struct dentry *hwlat_sample_window; /* sample window us */
>>> /* Save the previous tracing_thresh value */
>>> static unsigned long save_tracing_thresh;
>>>
>>> +/* NMI timestamp counters */
>>> +static u64 nmi_ts_start;
>>> +static u64 nmi_total_ts;
>>> +static int nmi_count;
>>> +static int nmi_cpu;
>>
>> and this is always limited to one CPU at a time?
>
> Yes. Hence the "nmi_cpu".

I was just confused. So we check one CPU at a time. Okay.

>>> @@ -125,6 +138,19 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
>>> #define init_time(a, b) (a = b)
>>> #define time_u64(a) a
>>>
>>> +void trace_hwlat_callback(bool enter)
>>> +{
>>> + if (smp_processor_id() != nmi_cpu)
>>> + return;
>>> +
>>> + if (enter)
>>> + nmi_ts_start = time_get();
>>
>> but more interestingly: trace_clock_local() -> sched_clock()
>> and of kernel/time/sched_clock.c we do raw_read_seqcount(&cd.seq) which
>> means we are busted if the NMI triggers during update_clock_read_data().
>
> Hmm, interesting. Because this is true for general tracing from an NMI.
>
> /me looks at code.
>
> Ah, this is when we have GENERIC_SCHED_CLOCK, which would break tracing
> if any arch that has this also has NMIs. Probably need to look at arm64.

arm64 should use the generic code as they don't provide sched_clock()
(and I doubt they go for the weak jiffy version).

> For x86, it has its own NMI safe sched_clock. I could make this "NMI"
> code depend on:
>
> #ifndef CONFIG_GENERIC_SCHED_CLOCK

that would be nice. That would be disable approx $(git grep
sched_clock_register | wc -l) users but better than a lock up I guess.

>
> -- Steve

Sebastian

Steven Rostedt

unread,

Aug 5, 2016, 12:17:23 PM8/5/16

to Sebastian Andrzej Siewior, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

On Fri, 5 Aug 2016 17:40:43 +0200

Sebastian Andrzej Siewior <big...@linutronix.de> wrote:

> > Ah, this is when we have GENERIC_SCHED_CLOCK, which would break tracing
> > if any arch that has this also has NMIs. Probably need to look at arm64.
>
> arm64 should use the generic code as they don't provide sched_clock()
> (and I doubt they go for the weak jiffy version).

And I'm guessing that arm64 supports NMIs. Is there no NMI safe clock?
I probably need to look at the tracing code. Because you can have the
same issue while tracing in an NMI as well.

-- Steve

Steven Rostedt

unread,

Aug 9, 2016, 1:23:06 PM8/9/16

to Sebastian Andrzej Siewior, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde

On Fri, 5 Aug 2016 17:40:43 +0200

Sebastian Andrzej Siewior <big...@linutronix.de> wrote:

> > For x86, it has its own NMI safe sched_clock. I could make this "NMI"
> > code depend on:
> >
> > #ifndef CONFIG_GENERIC_SCHED_CLOCK
>
> that would be nice. That would be disable approx $(git grep
> sched_clock_register | wc -l) users but better than a lock up I guess.

How about this patch. It still records nmi_count, just not the time
spent in the NMI.

-- Steve

As NMIs can also cause latency when interrupts are disabled, the hwlat
detectory has no way to know if the latency it detects is from an NMI or an
SMI or some other hardware glitch.

As ftrace_nmi_enter/exit() funtions are no longer used (except for sh, which
isn't supported anymore), I converted those to "arch_ftrace_nmi_enter/exit"
and use ftrace_nmi_enter/exit() to check if hwlat detector is tracing or
not, and if so, it calls into the hwlat utility.

Since the hwlat detector only has a single kthread that is spinning with
interrupts disabled, it marks what CPU it is on, and if the NMI callback
happens on that CPU, it records the time spent in that NMI. This is added to
the output that is generated by the hwlat detector as:

[1] inner:0 outer:8 ts:1470329211.586481492
[2] inner:0 outer:4 ts:1470329231.619726657 nmi-total:4 nmi-count:1
[3] inner:0 outer:5 ts:1470329234.550761052 nmi-total:4 nmi-count:1
[4] inner:8 outer:0 ts:1470329263.655911668

All time is still tracked in microseconds.

The NMI information is only shown when an NMI occurred during the sample.

Signed-off-by: Steven Rostedt <ros...@goodmis.org>
---
arch/sh/kernel/ftrace.c | 4 ++--

include/linux/ftrace_irq.h | 31 +++++++++++++++++++++++----
kernel/trace/trace_entries.h | 8 +++++--
kernel/trace/trace_hwlat.c | 51 ++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 16 +++++++++++++-
5 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/arch/sh/kernel/ftrace.c b/arch/sh/kernel/ftrace.c
index 38993e09ef03..deb9eba3a43d 100644
--- a/arch/sh/kernel/ftrace.c
+++ b/arch/sh/kernel/ftrace.c
@@ -139,7 +139,7 @@ static void ftrace_mod_code(void)
clear_mod_flag();
}

-void ftrace_nmi_enter(void)
+void arch_ftrace_nmi_enter(void)
{
if (atomic_inc_return(&nmi_running) & MOD_CODE_WRITE_FLAG) {
smp_rmb();
@@ -150,7 +150,7 @@ void ftrace_nmi_enter(void)
smp_mb();
}

-void ftrace_nmi_exit(void)
+void arch_ftrace_nmi_exit(void)
{
/* Finish all executions before clearing nmi_running */
smp_mb();

diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
index dca7bf8cffe2..4ec2c9b205f2 100644
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -3,11 +3,34 @@

#ifdef CONFIG_FTRACE_NMI_ENTER
-extern void ftrace_nmi_enter(void);
-extern void ftrace_nmi_exit(void);
+extern void arch_ftrace_nmi_enter(void);
+extern void arch_ftrace_nmi_exit(void);
#else
-static inline void ftrace_nmi_enter(void) { }
-static inline void ftrace_nmi_exit(void) { }
+static inline void arch_ftrace_nmi_enter(void) { }
+static inline void arch_ftrace_nmi_exit(void) { }
#endif

+#ifdef CONFIG_HWLAT_TRACER
+extern bool trace_hwlat_callback_enabled;
+extern void trace_hwlat_callback(bool enter);
+#endif
+

+static inline void ftrace_nmi_enter(void)
+{
+#ifdef CONFIG_HWLAT_TRACER
+ if (trace_hwlat_callback_enabled)
+ trace_hwlat_callback(true);

+#endif
+ arch_ftrace_nmi_enter();
+}
+
+static inline void ftrace_nmi_exit(void)
+{
+ arch_ftrace_nmi_exit();

+#ifdef CONFIG_HWLAT_TRACER
+ if (trace_hwlat_callback_enabled)

index f0b5b2944ee7..2a668e55dcc6 100644

--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -64,6 +64,15 @@ static struct dentry *hwlat_sample_window; /* sample window us */
/* Save the previous tracing_thresh value */
static unsigned long save_tracing_thresh;

+/* NMI timestamp counters */
+static u64 nmi_ts_start;
+static u64 nmi_total_ts;
+static int nmi_count;
+static int nmi_cpu;

+
+/* Tells NMIs to call back to the hwlat tracer to record timestamps */
+bool trace_hwlat_callback_enabled;
+
/* If the user changed threshold, remember it */
static u64 last_tracing_thresh = DEFAULT_LAT_THRESHOLD * NSEC_PER_USEC;

@@ -72,7 +81,9 @@ struct hwlat_sample {
u64 seqnum; /* unique sequence */
u64 duration; /* delta */
u64 outer_duration; /* delta (outer loop) */
+ u64 nmi_total_ts; /* Total time spent in NMIs */
struct timespec timestamp; /* wall time */
+ int nmi_count; /* # NMIs during this sample */
};

/* keep the global state somewhere. */
@@ -112,6 +123,8 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
entry->duration = sample->duration;
entry->outer_duration = sample->outer_duration;
entry->timestamp = sample->timestamp;
+ entry->nmi_total_ts = sample->nmi_total_ts;
+ entry->nmi_count = sample->nmi_count;

if (!call_filter_check_discard(call, entry, buffer, event))
__buffer_unlock_commit(buffer, event);

@@ -125,6 +138,26 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)

#define init_time(a, b) (a = b)
#define time_u64(a) a

+void trace_hwlat_callback(bool enter)
+{
+ if (smp_processor_id() != nmi_cpu)
+ return;
+

+ /*
+ * Currently trace_clock_local() calls sched_clock() and the
+ * generic version is not NMI safe.
+ */
+ if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {

+ if (enter)
+ nmi_ts_start = time_get();

+ else
+ nmi_total_ts = time_get() - nmi_ts_start;
+ }

+
+ if (enter)
+ nmi_count++;

+}
+
/**
* get_sample - sample the CPU TSC and look for likely hardware latencies
*

@@ -144,6 +177,14 @@ static int get_sample(void)

do_div(thresh, NSEC_PER_USEC); /* modifies interval value */

+ nmi_cpu = smp_processor_id();
+ nmi_total_ts = 0;
+ nmi_count = 0;
+ /* Make sure NMIs see this first */
+ barrier();
+
+ trace_hwlat_callback_enabled = true;
+
init_time(last_t2, 0);
start = time_get(); /* start timestamp */

@@ -188,6 +229,10 @@ static int get_sample(void)

} while (total <= hwlat_data.sample_width);

+ barrier(); /* finish the above in the view for NMIs */
+ trace_hwlat_callback_enabled = false;
+ barrier(); /* Make sure nmi_total_ts is no longer updated */
+
ret = 0;

/* If we exceed the threshold value, we have found a hardware latency */

@@ -196,11 +241,17 @@ static int get_sample(void)

ret = 1;

+ /* We read in microseconds */
+ if (nmi_total_ts)
+ do_div(nmi_total_ts, NSEC_PER_USEC);
+
hwlat_data.count++;
s.seqnum = hwlat_data.count;
s.duration = sample;
s.outer_duration = outer_sample;
s.timestamp = CURRENT_TIME;
+ s.nmi_total_ts = nmi_total_ts;
+ s.nmi_count = nmi_count;
trace_hwlat_sample(&s);

/* Keep a running maximum ever recorded hardware latency */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c

index cc5db62a3929..5478a97e8db3 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1109,13 +1109,27 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,

trace_assign_type(field, entry);

- trace_seq_printf(s, "[%u] inner:%llu outer:%lld ts:%ld.%09ld\n",
+ trace_seq_printf(s, "[%u] inner:%llu outer:%lld ts:%ld.%09ld",
field->seqnum,
field->duration,
field->outer_duration,
field->timestamp.tv_sec,
field->timestamp.tv_nsec);

+ if (field->nmi_count) {

+ /*
+ * The generic sched_clock() is not NMI safe, thus
+ * we only record the count and not the time.
+ */
+ if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK))
+ trace_seq_printf(s, " nmi-total:%llu",
+ field->nmi_total_ts);
+ trace_seq_printf(s, " nmi-count:%u",
+ field->nmi_count);
+ }

+
+ trace_seq_putc(s, '\n');

+
return trace_handle_return(s);
}

--
1.9.3

Steven Rostedt

unread,

Aug 9, 2016, 2:05:56 PM8/9/16

to linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

If an arch supports counting of SMIs (like newer intel chips do), then it
can implement arch_smi_count() to return the number of SMIs that were
triggered. The hwlat detector will call this function to get the current
number of SMIs, and then after a period, it will read that function again,
and if there's a difference, it will record that into the sample.

For example:

[99] inner:13 outer:16 ts:1470352534.886878855
[100] inner:14 outer:18747 ts:1470352538.917966818 smi-count:2
[101] inner:0 outer:19162 ts:1470352539.920988709 smi-count:6
[102] inner:19376 outer:19276 ts:1470352540.923010578 smi-count:6
[103] inner:19650 outer:20665 ts:1470352541.926032469 smi-count:6
[104] inner:20526 outer:20680 ts:1470352542.973055312 smi-count:6
[105] inner:17 outer:17 ts:1470352543.990077507

Signed-off-by: Steven Rostedt <ros...@goodmis.org>
---

arch/x86/events/msr.c | 12 ++++++++++++
kernel/trace/trace_entries.h | 1 +
kernel/trace/trace_hwlat.c | 11 +++++++++++
kernel/trace/trace_output.c | 4 ++++
4 files changed, 28 insertions(+)

diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index 85ef3c2e80e0..ff0c6e6351b0 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -27,6 +27,18 @@ static bool test_irperf(int idx)
return boot_cpu_has(X86_FEATURE_IRPERF);
}

+int arch_smi_count(void)
+{
+ unsigned long long count;
+ int err;
+
+ err = rdmsrl_safe(MSR_SMI_COUNT, &count);
+ if (err)
+ return 0;
+
+ return count;
+}
+
static bool test_intel(int idx)
{
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index d1cc37e78f99..207faa837d3d 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -335,6 +335,7 @@ FTRACE_ENTRY(hwlat, hwlat_entry,

__field_desc( long, timestamp, tv_sec )
__field_desc( long, timestamp, tv_nsec )

__field( unsigned int, nmi_count )

+ __field( unsigned int, smi_count )

__field( unsigned int, seqnum )
),

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 2a668e55dcc6..1d60ef5c404f 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -69,6 +69,7 @@ static u64 nmi_ts_start;
static u64 nmi_total_ts;
static int nmi_count;
static int nmi_cpu;
+static int smi_count;

/* Tells NMIs to call back to the hwlat tracer to record timestamps */

bool trace_hwlat_callback_enabled;
@@ -84,6 +85,7 @@ struct hwlat_sample {

u64 nmi_total_ts; /* Total time spent in NMIs */
struct timespec timestamp; /* wall time */

int nmi_count; /* # NMIs during this sample */

+ int smi_count; /* # SMIs during sampling (if arch supported) */

};

/* keep the global state somewhere. */

@@ -125,6 +127,7 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)
entry->timestamp = sample->timestamp;

entry->nmi_total_ts = sample->nmi_total_ts;

entry->nmi_count = sample->nmi_count;

+ entry->smi_count = sample->smi_count;

if (!call_filter_check_discard(call, entry, buffer, event))
__buffer_unlock_commit(buffer, event);

@@ -138,6 +141,11 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)

#define init_time(a, b) (a = b)
#define time_u64(a) a

+__weak unsigned long long arch_smi_count(void)
+{
+ return 0;
+}
+
void trace_hwlat_callback(bool enter)
{
if (smp_processor_id() != nmi_cpu)
@@ -180,6 +188,7 @@ static int get_sample(void)
nmi_cpu = smp_processor_id();
nmi_total_ts = 0;
nmi_count = 0;
+ smi_count = arch_smi_count();

/* Make sure NMIs see this first */

barrier();

@@ -231,6 +240,7 @@ static int get_sample(void)

barrier(); /* finish the above in the view for NMIs */

trace_hwlat_callback_enabled = false;
+ smi_count = arch_smi_count() - smi_count;

barrier(); /* Make sure nmi_total_ts is no longer updated */

ret = 0;
@@ -252,6 +262,7 @@ static int get_sample(void)
s.timestamp = CURRENT_TIME;
s.nmi_total_ts = nmi_total_ts;
s.nmi_count = nmi_count;
+ s.smi_count = smi_count;

trace_hwlat_sample(&s);

/* Keep a running maximum ever recorded hardware latency */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c

index 5478a97e8db3..498eb7363e05 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1128,6 +1128,10 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
field->nmi_count);
}

+ if (field->smi_count)
+ trace_seq_printf(s, " smi-count:%u",
+ field->smi_count);

Daniel Bristot de Oliveira

unread,

Aug 9, 2016, 2:29:13 PM8/9/16

to Steven Rostedt, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

On 08/09/2016 03:05 PM, Steven Rostedt wrote:
> If an arch supports counting of SMIs (like newer intel chips do), then it
> can implement arch_smi_count() to return the number of SMIs that were
> triggered. The hwlat detector will call this function to get the current
> number of SMIs, and then after a period, it will read that function again,
> and if there's a difference, it will record that into the sample.
>
> For example:
>
> [99] inner:13 outer:16 ts:1470352534.886878855
> [100] inner:14 outer:18747 ts:1470352538.917966818 smi-count:2
> [101] inner:0 outer:19162 ts:1470352539.920988709 smi-count:6
> [102] inner:19376 outer:19276 ts:1470352540.923010578 smi-count:6
> [103] inner:19650 outer:20665 ts:1470352541.926032469 smi-count:6
> [104] inner:20526 outer:20680 ts:1470352542.973055312 smi-count:6
> [105] inner:17 outer:17 ts:1470352543.990077507
>
> Signed-off-by: Steven Rostedt <ros...@goodmis.org>

It worked fine in a system that I can manually cause SMIs (by turning
keyboard's backlight on and off).

Tested-by: Daniel Bristot de Oliveira <bri...@redhat.com>

-- Daniel

Steven Rostedt

unread,

Aug 9, 2016, 2:30:49 PM8/9/16

to Clark Williams, Carsten Emde, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Thomas Gleixner, Jon Masters, Daniel Wagner, Sebastian Andrzej Siewior

On Tue, 9 Aug 2016 13:15:16 -0500
Clark Williams <will...@redhat.com> wrote:

> I like the trace report, but for one small thing; I think it's worth
> either adding units to the reported latency values (us suffix) or
> putting in a header line that hwlatdetect values are reported in
> microseconds.

Yeah, adding "us" to the time stamps would be useful. I'll update.

-- Steve

Steven Rostedt

unread,

Aug 9, 2016, 2:35:46 PM8/9/16

to Daniel Bristot de Oliveira, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

Thanks! I should also add:

Suggested-by: Peter Zijlstra <pet...@infradead.org>

As Peter was the one that recommended adding the arch_smi_count()
function.

But I can't push this yet as this would require acks from the x86
maintainers.

But if there's no complaints about the rest of the patches, I could
work to get that ready.

-- Steve

Peter Zijlstra

unread,

Aug 9, 2016, 5:26:17 PM8/9/16

to Steven Rostedt, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

On Tue, Aug 09, 2016 at 02:05:43PM -0400, Steven Rostedt wrote:
> +int arch_smi_count(void)
> +{
> + unsigned long long count;
> + int err;
> +
> + err = rdmsrl_safe(MSR_SMI_COUNT, &count);
> + if (err)
> + return 0;

That's really yucky, relying on _safe() to detect availability.

Also, I just found AMD Fam15h has this counter through PMCs (event
0x2b).

Steven Rostedt

unread,

Aug 10, 2016, 2:51:42 PM8/10/16

to linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

Changes since v1:

. Added checks for CONFIG_GENERIC_SCHED_CLOCK, and if that is set,
only nmi_counts will be recorded when an NMI is triggered, but not
the time in NMI, because the generic sched_clock is not NMI safe.
(suggested by Sebastian Andrzej Siewior)

. Changed the format a little to be a little nicer looking.

Was: [99] inner:13 outer:16 ts:1470352534.886878855
Now: #99 inner/outer(us): 13/16 ts:1470352534.886878855

. Included the NMI patch into the main patch set.

. Holding off SMI counting until I have an arch_smi_count() that isn't
so "really yucky".

Jon Masters (1):
tracing: Add documentation for hwlat_detector tracer

Steven Rostedt (Red Hat) (3):

tracing: Added hardware latency tracer
tracing: Have hwlat trace migrate across tracing_cpumask CPUs

tracing: Add NMI tracing in hwlat detector

----
Documentation/trace/hwlat_detector.txt | 79 ++++
arch/sh/kernel/ftrace.c | 4 +-
include/linux/ftrace_irq.h | 31 +-

kernel/trace/Kconfig | 35 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 2 +-
kernel/trace/trace.h | 3 +

kernel/trace/trace_entries.h | 27 ++
kernel/trace/trace_hwlat.c | 633 +++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 66 ++++
10 files changed, 874 insertions(+), 7 deletions(-)

Jon Masters

unread,

Aug 10, 2016, 2:57:54 PM8/10/16

to Steven Rostedt, Sebastian Andrzej Siewior, linux-...@vger.kernel.org, linux-r...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Daniel Wagner, Carsten Emde

On 08/05/2016 10:52 AM, Steven Rostedt wrote:

> Ah, this is when we have GENERIC_SCHED_CLOCK, which would break tracing
> if any arch that has this also has NMIs. Probably need to look at arm64.

Today, we don't have NMIs on arm64 that are supported in Linux.

Jon.

--
Computer Architect | Sent from my Fedora powered laptop

Steven Rostedt

unread,

Aug 10, 2016, 4:34:53 PM8/10/16

to linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

0003-tracing-Have-hwlat-trace-migrate-across-tracing_cpum.patch

Steven Rostedt

unread,

Aug 10, 2016, 4:34:55 PM8/10/16

to linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

0004-tracing-Add-NMI-tracing-in-hwlat-detector.patch

Steven Rostedt

unread,

Aug 10, 2016, 5:17:59 PM8/10/16

to linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

0002-tracing-Add-documentation-for-hwlat_detector-tracer.patch

Steven Rostedt

unread,

Aug 10, 2016, 5:18:24 PM8/10/16

to linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

0001-tracing-Added-hardware-latency-tracer.patch

Daniel Bristot de Oliveira

unread,

Aug 12, 2016, 2:14:00 PM8/12/16

to Steven Rostedt, linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

On 08/10/2016 10:53 AM, Steven Rostedt wrote:
> This adds the PREEMPT_RT hwlat detector as a Linux tracer in mainline.
> In the PREEMPT_RT patch set, it is a separate entity that is controlled
> by the debugfs file system. I found that it is better suited as a
> latency tracer in the tracing directory, as it follows pretty much the
> same paradigm as the other latency tracers that already exist. All
> I had to add was a hwlat_detector directory that contained a window
> and width for the period and duration respectively of the test. But
> the samples would just write to the tracing ring buffer and the max
> latency would be stored in tracing_max_latency, and the threshold can
> be set by the existing tracing_threshold. The last patch also adds a
> new feature that would have the kthread migrate after each period to
> another CPU specified by tracing_cpumask.

Hi!

I tested this patchset in a system which I can cause SMIs. The results
are consistent with the latency I see when I run cyclictest in this box
and cause SMIs on it. The tracer will be more accurate, as expected. So:

Tested-by: Daniel Bristot de Oliveira <bri...@redhat.com>

Regarding SMI count, when I added SMI count support to cyclictest I
based hardware support check on turbostat's implementation, at

tools/power/x86/turbostat/turbostat.c:probe_nhm_msrs()

It is for user-space, but I think it is a good starting point... Just
thinking aloud.

-- Daniel

Peter Zijlstra

unread,

Aug 19, 2016, 10:09:52 AM8/19/16

to Steven Rostedt, linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

On Wed, Aug 10, 2016 at 09:53:56AM -0400, Steven Rostedt wrote:
> Changes since v1:
>
> . Added checks for CONFIG_GENERIC_SCHED_CLOCK, and if that is set,
> only nmi_counts will be recorded when an NMI is triggered, but not
> the time in NMI, because the generic sched_clock is not NMI safe.
> (suggested by Sebastian Andrzej Siewior)

So any of the platforms using GENERIC_SCHED_CLOCK _have_ NMIs ?

In any case, for those you could probably use ktime_get_mono_fast_ns().

Steven Rostedt

unread,

Aug 19, 2016, 10:53:37 AM8/19/16

to Peter Zijlstra, linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

On Fri, 19 Aug 2016 16:09:28 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Wed, Aug 10, 2016 at 09:53:56AM -0400, Steven Rostedt wrote:
> > Changes since v1:
> >
> > . Added checks for CONFIG_GENERIC_SCHED_CLOCK, and if that is set,
> > only nmi_counts will be recorded when an NMI is triggered, but not
> > the time in NMI, because the generic sched_clock is not NMI safe.
> > (suggested by Sebastian Andrzej Siewior)
>
> So any of the platforms using GENERIC_SCHED_CLOCK _have_ NMIs ?

From what I gather, the answer is no, so I don't think this is an issue.

>
> In any case, for those you could probably use ktime_get_mono_fast_ns().

Is that safe to call from NMI? Looking into the code I see:

now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);

where timekeeping_get_ns() has:

timekeeping_get_delta()

which does:

read_seqcount_begin()

which has (eventually):

repeat:
ret = READ_ONCE(s->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
return ret;

Which I think could cause a deadlock from an NMI.

-- Steve

Peter Zijlstra

unread,

Aug 19, 2016, 11:06:14 AM8/19/16

to Steven Rostedt, linux-...@vger.kernel.org, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, John Stultz

On Fri, Aug 19, 2016 at 10:26:20AM -0400, Steven Rostedt wrote:
> > In any case, for those you could probably use ktime_get_mono_fast_ns().
>
> Is that safe to call from NMI?

It should be, we were very careful to make it so. Also read the comment
with __ktime_get_fast_ns().

> Looking into the code I see:
>
> now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);
>
> where timekeeping_get_ns() has:
>
> timekeeping_get_delta()
>
> which does:
>
> read_seqcount_begin()
>
> which has (eventually):

That's a bug in CONFIG_DEBUG_TIMEKEEPING, if you look at the
!timekeeping_get_delta() it does no such thing.

John, looks like 4ca22c2648f9 ("timekeeping: Add warnings when overflows
or underflows are observed") buggered things.

John Stultz

unread,

Aug 19, 2016, 12:36:50 PM8/19/16

to Peter Zijlstra, Steven Rostedt, lkml, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

Oof. Apologies. I missed that the fast methods called that helper as well.

Any objection to open-coding it for the fast method, like (sorry for
the whitespace damage here):

@@ -401,7 +401,10 @@ static __always_inline u64
__ktime_get_fast_ns(struct tk_fast *tkf)
do {
seq = raw_read_seqcount_latch(&tkf->seq);
tkr = tkf->base + (seq & 0x01);
- now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);
+ now = ktime_to_ns(tkr->base);
+
+ now += clocksource_delta(tkr->read(tkr->clock),
+ tkr->cycle_last, tkr->mask);
} while (read_seqcount_retry(&tkf->seq, seq));

return now;

If not I'll get such a patch sorted, tested and tagged for -stable.

thanks
-john

Nilay Vaish

unread,

Aug 22, 2016, 1:29:00 PM8/22/16

to Steven Rostedt, Linux Kernel list, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

On 10 August 2016 at 08:53, Steven Rostedt <ros...@goodmis.org> wrote:
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> new file mode 100644
> index 000000000000..08dfabe4e862
> --- /dev/null
> +++ b/kernel/trace/trace_hwlat.c

> @@ -0,0 +1,527 @@
> +/*
> + * trace_hwlatdetect.c - A simple Hardware Latency detector.
> + *
> + * Use this tracer to detect large system latencies induced by the behavior of
> + * certain underlying system hardware or firmware, independent of Linux itself.
> + * The code was developed originally to detect the presence of SMIs on Intel
> + * and AMD systems, although there is no dependency upon x86 herein.
> + *
> + * The classical example usage of this tracer is in detecting the presence of
> + * SMIs or System Management Interrupts on Intel and AMD systems. An SMI is a
> + * somewhat special form of hardware interrupt spawned from earlier CPU debug
> + * modes in which the (BIOS/EFI/etc.) firmware arranges for the South Bridge
> + * LPC (or other device) to generate a special interrupt under certain
> + * circumstances, for example, upon expiration of a special SMI timer device,
> + * due to certain external thermal readings, on certain I/O address accesses,
> + * and other situations. An SMI hits a special CPU pin, triggers a special
> + * SMI mode (complete with special memory map), and the OS is unaware.
> + *
> + * Although certain hardware-inducing latencies are necessary (for example,
> + * a modern system often requires an SMI handler for correct thermal control
> + * and remote management) they can wreak havoc upon any OS-level performance
> + * guarantees toward low-latency, especially when the OS is not even made
> + * aware of the presence of these interrupts. For this reason, we need a
> + * somewhat brute force mechanism to detect these interrupts. In this case,
> + * we do it by hogging all of the CPU(s) for configurable timer intervals,
> + * sampling the built-in CPU timer, looking for discontiguous readings.
> + *
> + * WARNING: This implementation necessarily introduces latencies. Therefore,
> + * you should NEVER use this tracer while running in a production
> + * environment requiring any kind of low-latency performance
> + * guarantee(s).
> + *
> + * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <j...@redhat.com>
> + * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <sros...@redhat.com>
> + *
> + * Includes useful feedback from Clark Williams <cl...@redhat.com>
> + *
> + * This file is licensed under the terms of the GNU General Public
> + * License version 2. This program is licensed "as is" without any
> + * warranty of any kind, whether express or implied.
> + */
> +#include <linux/kthread.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <linux/delay.h>
> +#include "trace.h"
> +
> +static struct trace_array *hwlat_trace;
> +
> +#define U64STR_SIZE 22 /* 20 digits max */
> +
> +#define BANNER "hwlat_detector: "
> +#define DEFAULT_SAMPLE_WINDOW 1000000 /* 1s */
> +#define DEFAULT_SAMPLE_WIDTH 500000 /* 0.5s */
> +#define DEFAULT_LAT_THRESHOLD 10 /* 10us */
> +
> +/* sampling thread*/
> +static struct task_struct *hwlat_kthread;
> +
> +static struct dentry *hwlat_sample_width; /* sample width us */
> +static struct dentry *hwlat_sample_window; /* sample window us */
> +
> +/* Save the previous tracing_thresh value */
> +static unsigned long save_tracing_thresh;
> +
> +/* If the user changed threshold, remember it */
> +static u64 last_tracing_thresh = DEFAULT_LAT_THRESHOLD * NSEC_PER_USEC;
> +
> +/* Individual latency samples are stored here when detected. */
> +struct hwlat_sample {
> + u64 seqnum; /* unique sequence */
> + u64 duration; /* delta */
> + u64 outer_duration; /* delta (outer loop) */
> + struct timespec timestamp; /* wall time */
> +};
> +
> +/* keep the global state somewhere. */
> +static struct hwlat_data {
> +
> + struct mutex lock; /* protect changes */
> +
> + u64 count; /* total since reset */
> +
> + u64 sample_window; /* total sampling window (on+off) */
> + u64 sample_width; /* active sampling portion of window */
> +
> +} hwlat_data = {
> + .sample_window = DEFAULT_SAMPLE_WINDOW,
> + .sample_width = DEFAULT_SAMPLE_WIDTH,
> +};
> +
> +static void trace_hwlat_sample(struct hwlat_sample *sample)
> +{
> + struct trace_array *tr = hwlat_trace;
> + struct trace_event_call *call = &event_hwlat;

Steven, where is this variable event_hwlat declared? To me it seems
that some macro is declaring it (most likely DEFINE_EVENT) but I was
not able to figure out the chain that ends up in the declaration.

--
Nilay

Steven Rostedt

unread,

Aug 22, 2016, 3:16:25 PM8/22/16

to Nilay Vaish, Linux Kernel list, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior, Peter Zijlstra

On Mon, 22 Aug 2016 12:27:58 -0500
Nilay Vaish <nilay...@gmail.com> wrote:

> > +static void trace_hwlat_sample(struct hwlat_sample *sample)
> > +{
> > + struct trace_array *tr = hwlat_trace;
> > + struct trace_event_call *call = &event_hwlat;
>
> Steven, where is this variable event_hwlat declared? To me it seems
> that some macro is declaring it (most likely DEFINE_EVENT) but I was
> not able to figure out the chain that ends up in the declaration.

Yeah, it's created similarly to how trace_events are created.

It's created via kernel/trace/trace_export.c using
kernel/trace/trace_entries.h

-- Steve

Thomas Gleixner

unread,

Aug 23, 2016, 4:52:01 AM8/23/16

to Peter Zijlstra, John Stultz, Steven Rostedt, lkml, Ingo Molnar, Andrew Morton, Clark Williams, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

On Tue, 23 Aug 2016, Peter Zijlstra wrote:

> Works for me, Thomas?

No objections.

Peter Zijlstra

unread,

Aug 23, 2016, 5:49:24 AM8/23/16

to John Stultz, Steven Rostedt, lkml, Ingo Molnar, Andrew Morton, Clark Williams, Thomas Gleixner, Jon Masters, Daniel Wagner, Carsten Emde, Sebastian Andrzej Siewior

Works for me, Thomas?