Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] perf/sched: fix for getting task's execute time

2 views
Skip to first unread message

Xiao Guangrong

unread,
Dec 6, 2009, 6:00:02 AM12/6/09
to
In current code, we get task's execute time by reading
"/proc/<pid>/sched" file, it's wrong if the task is created
by pthread_create(), because every thread task has same pid.
So, the correct way is reading "/proc/<ppid>/task/<tid>/sched"
file.

This patch also remove redundant include files since <sys/types.h>
is included in "perf.h"

Signed-off-by: Xiao Guangrong <xiaogu...@cn.fujitsu.com>
---
tools/perf/builtin-sched.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 26b782f..b2e910e 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -13,7 +13,6 @@
#include "util/debug.h"
#include "util/data_map.h"

-#include <sys/types.h>
#include <sys/prctl.h>

#include <semaphore.h>
@@ -416,7 +415,7 @@ static u64 get_cpu_usage_nsec_parent(void)

static u64 get_cpu_usage_nsec_self(void)
{
- char filename [] = "/proc/1234567890/sched";
+ char filename [] = "/proc/1234567890/task/1234567890/sched";
unsigned long msecs, nsecs;
char *line = NULL;
u64 total = 0;
@@ -425,7 +424,9 @@ static u64 get_cpu_usage_nsec_self(void)
FILE *file;
int ret;

- sprintf(filename, "/proc/%d/sched", getpid());
+ sprintf(filename, "/proc/%d/task/%d/sched", getpid(),
+ (pid_t)syscall(SYS_gettid));
+
file = fopen(filename, "r");
BUG_ON(!file);

--
1.6.1.2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,
Dec 6, 2009, 6:10:02 AM12/6/09
to
On Sun, 2009-12-06 at 12:05 +0100, Peter Zijlstra wrote:

> On Sun, 2009-12-06 at 18:57 +0800, Xiao Guangrong wrote:
> > In current code, we get task's execute time by reading
> > "/proc/<pid>/sched" file, it's wrong if the task is created
> > by pthread_create(), because every thread task has same pid.
> > So, the correct way is reading "/proc/<ppid>/task/<tid>/sched"
> > file.
> >
> > This patch also remove redundant include files since <sys/types.h>
> > is included in "perf.h"
>
> We really should not be using these proc files but instead make sure
> this information gets transferred through a tracepoint or similar.
>
> Reading these proc files is too prone to races.

We can probably get the runtime by grouping a task-clock swcounter with
an appropriate other event.

Peter Zijlstra

unread,
Dec 6, 2009, 6:10:02 AM12/6/09
to
On Sun, 2009-12-06 at 18:57 +0800, Xiao Guangrong wrote:
> In current code, we get task's execute time by reading
> "/proc/<pid>/sched" file, it's wrong if the task is created
> by pthread_create(), because every thread task has same pid.
> So, the correct way is reading "/proc/<ppid>/task/<tid>/sched"
> file.
>
> This patch also remove redundant include files since <sys/types.h>
> is included in "perf.h"

We really should not be using these proc files but instead make sure


this information gets transferred through a tracepoint or similar.

Reading these proc files is too prone to races.

Ingo Molnar

unread,
Dec 6, 2009, 7:00:02 AM12/6/09
to

* Peter Zijlstra <pet...@infradead.org> wrote:

> On Sun, 2009-12-06 at 12:05 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-12-06 at 18:57 +0800, Xiao Guangrong wrote:
> > > In current code, we get task's execute time by reading
> > > "/proc/<pid>/sched" file, it's wrong if the task is created
> > > by pthread_create(), because every thread task has same pid.
> > > So, the correct way is reading "/proc/<ppid>/task/<tid>/sched"
> > > file.
> > >
> > > This patch also remove redundant include files since <sys/types.h>
> > > is included in "perf.h"
> >
> > We really should not be using these proc files but instead make sure
> > this information gets transferred through a tracepoint or similar.
> >
> > Reading these proc files is too prone to races.

yeah. Ideally we'd want all valuable information that is available via
/proc to be available via perf events as well. In the future it should
be possible to run perf even without /proc mounted for example.

Furthermore it's good for consistency and simplicity as well, plus it's
faster too to get the information from the perf syscall and mmap-ed
ringbuffer than to read things via /proc. No need for ASCII conversion,
fixed record formats, fast streaming and buffering, no read() overhead,
etc.

> We can probably get the runtime by grouping a task-clock swcounter
> with an appropriate other event.

Would be lovely.

Ingo

Xiao Guangrong

unread,
Dec 6, 2009, 12:20:02 PM12/6/09
to
Peter Zijlstra wrote:

>> We really should not be using these proc files but instead make sure
>> this information gets transferred through a tracepoint or similar.
>>
>> Reading these proc files is too prone to races.
>
> We can probably get the runtime by grouping a task-clock swcounter with
> an appropriate other event.
>

Hi Peter,

Thanks your suggestion.

Actually, we can call getrusage(RUSAGE_THREAD, ru) to get current task's
execute time, and I think this is a simpler way.

I'll send v2 patch later.

Thanks,
Xiao

Xiao Guangrong

unread,
Dec 7, 2009, 2:30:02 AM12/7/09
to
In current code, task's execute time is got by reading

'/proc/<pid>/sched' file, it's wrong if the task is created
by pthread_create(), because every thread task has same pid.

This way also has two demerits:

1: 'perf sched replay' can't work if the kernel not compile with
'CONFIG_SCHED_DEBUG' option
2: perf tool should depend on proc file system

So, this patch call getrusage() to get task's execution time instead
of reading /proc file

Reported-by: Török Edwin <edwin...@gmail.com>


Signed-off-by: Xiao Guangrong <xiaogu...@cn.fujitsu.com>
---

tools/perf/builtin-sched.c | 46 ++++++++++++++-----------------------------
1 files changed, 15 insertions(+), 31 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 19f43fa..bd57994 100644


--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -13,7 +13,6 @@
#include "util/debug.h"
#include "util/data_map.h"

-#include <sys/types.h>
#include <sys/prctl.h>

#include <semaphore.h>

@@ -399,49 +398,34 @@ process_sched_event(struct task_desc *this_task __used, struct sched_atom *atom)
}
}

+static u64 get_sum_time(struct rusage *ru)
+{
+ u64 sum;
+
+ sum = ru->ru_utime.tv_sec*1e9 + ru->ru_utime.tv_usec*1e3;
+ sum += ru->ru_stime.tv_sec*1e9 + ru->ru_stime.tv_usec*1e3;
+ return sum;
+}
+
static u64 get_cpu_usage_nsec_parent(void)
{
struct rusage ru;
- u64 sum;
int err;

err = getrusage(RUSAGE_SELF, &ru);
BUG_ON(err);

- sum = ru.ru_utime.tv_sec*1e9 + ru.ru_utime.tv_usec*1e3;
- sum += ru.ru_stime.tv_sec*1e9 + ru.ru_stime.tv_usec*1e3;
-
- return sum;
+ return get_sum_time(&ru);


}

static u64 get_cpu_usage_nsec_self(void)
{
- char filename [] = "/proc/1234567890/sched";

- unsigned long msecs, nsecs;
- char *line = NULL;
- u64 total = 0;
- size_t len = 0;
- ssize_t chars;
- FILE *file;
- int ret;
-


- sprintf(filename, "/proc/%d/sched", getpid());

- file = fopen(filename, "r");
- BUG_ON(!file);
-
- while ((chars = getline(&line, &len, file)) != -1) {
- ret = sscanf(line, "se.sum_exec_runtime : %ld.%06ld\n",
- &msecs, &nsecs);
- if (ret == 2) {
- total = msecs*1e6 + nsecs;
- break;
- }
- }
- if (line)
- free(line);
- fclose(file);
+ struct rusage ru;
+ int err;

- return total;
+ err = getrusage(RUSAGE_THREAD, &ru);
+ BUG_ON(err);
+ return get_sum_time(&ru);
}

static void *thread_func(void *ctx)
--
1.6.1.2

Ingo Molnar

unread,
Dec 7, 2009, 2:40:01 AM12/7/09
to

* Xiao Guangrong <xiaogu...@cn.fujitsu.com> wrote:

> In current code, task's execute time is got by reading
> '/proc/<pid>/sched' file, it's wrong if the task is created
> by pthread_create(), because every thread task has same pid.
>
> This way also has two demerits:
>
> 1: 'perf sched replay' can't work if the kernel not compile with
> 'CONFIG_SCHED_DEBUG' option
> 2: perf tool should depend on proc file system
>
> So, this patch call getrusage() to get task's execution time instead
> of reading /proc file

ok, that's better than /proc, but how about using
PERF_COUNT_SW_TASK_CLOCK instead of rusage, to recover the CPU time
used? It would be more precise, and it would use the perf API. (Some
helper functions in tools/perf/lib/ would be nice to make it as easy to
use as rusage (or even easier if possible).)

Ingo

Xiao Guangrong

unread,
Dec 9, 2009, 5:00:03 AM12/9/09
to
In current code, task's execute time is got by reading
'/proc/<pid>/sched' file, it's wrong if the task is created
by pthread_create(), because every thread task has same pid.

This way also has two demerits:

1: 'perf sched replay' can't work if the kernel not compile
with 'CONFIG_SCHED_DEBUG' option
2: perf tool should depend on proc file system

So, this patch use PERF_COUNT_SW_TASK_CLOCK to get task's


execution time instead of reading /proc file

Changelog v2 -> v3:
use PERF_COUNT_SW_TASK_CLOCK instead of rusage() as Ingo's suggestion

Reported-by: Török Edwin <edwin...@gmail.com>
Signed-off-by: Xiao Guangrong <xiaogu...@cn.fujitsu.com>
---

tools/perf/builtin-sched.c | 55 +++++++++++++++++++++----------------------
1 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 19f43fa..b12b23a 100644


--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -13,7 +13,6 @@
#include "util/debug.h"
#include "util/data_map.h"

-#include <sys/types.h>
#include <sys/prctl.h>

#include <semaphore.h>

@@ -414,34 +413,33 @@ static u64 get_cpu_usage_nsec_parent(void)
return sum;
}

-static u64 get_cpu_usage_nsec_self(void)
+static int self_open_counters(void)


{
- char filename [] = "/proc/1234567890/sched";
- unsigned long msecs, nsecs;
- char *line = NULL;
- u64 total = 0;
- size_t len = 0;
- ssize_t chars;
- FILE *file;
- int ret;

+ struct perf_event_attr attr;
+ int fd;



- sprintf(filename, "/proc/%d/sched", getpid());
- file = fopen(filename, "r");
- BUG_ON(!file);

+ memset(&attr, 0, sizeof(attr));



- while ((chars = getline(&line, &len, file)) != -1) {
- ret = sscanf(line, "se.sum_exec_runtime : %ld.%06ld\n",
- &msecs, &nsecs);
- if (ret == 2) {
- total = msecs*1e6 + nsecs;
- break;
- }
- }
- if (line)
- free(line);
- fclose(file);

+ attr.type = PERF_TYPE_SOFTWARE;
+ attr.config = PERF_COUNT_SW_TASK_CLOCK;

- return total;
+ fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+
+ if (fd < 0)
+ die("Error: sys_perf_event_open() syscall returned"
+ "with %d (%s)\n", fd, strerror(errno));
+ return fd;
+}
+
+static u64 get_cpu_usage_nsec_self(int fd)
+{
+ u64 runtime;
+ int ret;
+
+ ret = read(fd, &runtime, sizeof(runtime));
+ BUG_ON(ret != sizeof(runtime));
+
+ return runtime;
}

static void *thread_func(void *ctx)
@@ -450,9 +448,11 @@ static void *thread_func(void *ctx)
u64 cpu_usage_0, cpu_usage_1;
unsigned long i, ret;
char comm2[22];
+ int fd;

sprintf(comm2, ":%s", this_task->comm);
prctl(PR_SET_NAME, comm2);
+ fd = self_open_counters();

again:
ret = sem_post(&this_task->ready_for_work);
@@ -462,16 +462,15 @@ again:
ret = pthread_mutex_unlock(&start_work_mutex);
BUG_ON(ret);

- cpu_usage_0 = get_cpu_usage_nsec_self();
+ cpu_usage_0 = get_cpu_usage_nsec_self(fd);

for (i = 0; i < this_task->nr_events; i++) {
this_task->curr_event = i;
process_sched_event(this_task, this_task->atoms[i]);
}

- cpu_usage_1 = get_cpu_usage_nsec_self();
+ cpu_usage_1 = get_cpu_usage_nsec_self(fd);
this_task->cpu_usage = cpu_usage_1 - cpu_usage_0;
-
ret = sem_post(&this_task->work_done_sem);
BUG_ON(ret);

--
1.6.1.2

Xiao Guangrong

unread,
Dec 9, 2009, 5:00:05 AM12/9/09
to
In current code, task's execute time is got by reading
'/proc/<pid>/sched' file, it's wrong if the task is created
by pthread_create(), because every thread task has same pid.

This way also has two demerits:

1: 'perf sched replay' can't work if the kernel not compile
with 'CONFIG_SCHED_DEBUG' option
2: perf tool should depend on proc file system

So, this patch use PERF_COUNT_SW_TASK_CLOCK to get task's


execution time instead of reading /proc file

Changelog v2 -> v3:


use PERF_COUNT_SW_TASK_CLOCK instead of rusage() as Ingo's suggestion

Signed-off-by: Xiao Guangrong <xiaogu...@cn.fujitsu.com>
---
tools/perf/builtin-sched.c | 55 +++++++++++++++++++++----------------------
1 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 19f43fa..b12b23a 100644


--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -13,7 +13,6 @@
#include "util/debug.h"
#include "util/data_map.h"

-#include <sys/types.h>
#include <sys/prctl.h>

#include <semaphore.h>

@@ -414,34 +413,33 @@ static u64 get_cpu_usage_nsec_parent(void)
return sum;
}

-static u64 get_cpu_usage_nsec_self(void)
+static int self_open_counters(void)
{

- char filename [] = "/proc/1234567890/sched";
- unsigned long msecs, nsecs;
- char *line = NULL;
- u64 total = 0;
- size_t len = 0;
- ssize_t chars;
- FILE *file;
- int ret;

+ struct perf_event_attr attr;
+ int fd;

- sprintf(filename, "/proc/%d/sched", getpid());
- file = fopen(filename, "r");
- BUG_ON(!file);

+ memset(&attr, 0, sizeof(attr));



- while ((chars = getline(&line, &len, file)) != -1) {
- ret = sscanf(line, "se.sum_exec_runtime : %ld.%06ld\n",
- &msecs, &nsecs);
- if (ret == 2) {
- total = msecs*1e6 + nsecs;
- break;
- }
- }
- if (line)
- free(line);
- fclose(file);

--

Xiao Guangrong

unread,
Dec 9, 2009, 5:00:05 AM12/9/09
to
Sorry, miss Reported-by, please ignore this mail, i'll resend it

Xiao Guangrong wrote:
> In current code, task's execute time is got by reading
> '/proc/<pid>/sched' file, it's wrong if the task is created
> by pthread_create(), because every thread task has same pid.
>
> This way also has two demerits:
>
> 1: 'perf sched replay' can't work if the kernel not compile
> with 'CONFIG_SCHED_DEBUG' option
> 2: perf tool should depend on proc file system
>
> So, this patch use PERF_COUNT_SW_TASK_CLOCK to get task's
> execution time instead of reading /proc file
>
> Changelog v2 -> v3:
> use PERF_COUNT_SW_TASK_CLOCK instead of rusage() as Ingo's suggestion
>
> Signed-off-by: Xiao Guangrong <xiaogu...@cn.fujitsu.com>

Ingo Molnar

unread,
Dec 9, 2009, 5:00:06 AM12/9/09
to

* Xiao Guangrong <xiaogu...@cn.fujitsu.com> wrote:

Very nice - and the code even got shorter a tiny bit.

Applied, thanks!

Ingo

Ingo Molnar

unread,
Dec 9, 2009, 5:10:01 AM12/9/09
to

* Xiao Guangrong <xiaogu...@cn.fujitsu.com> wrote:

> Sorry, miss Reported-by, please ignore this mail, i'll resend it

I've added Edwin's Reported-by, no need to resend.

Ingo

tip-bot for Xiao Guangrong

unread,
Dec 9, 2009, 5:10:02 AM12/9/09
to
Commit-ID: c0c9e72150c07b4a6766cd24a6f685ec2dc9c343
Gitweb: http://git.kernel.org/tip/c0c9e72150c07b4a6766cd24a6f685ec2dc9c343
Author: Xiao Guangrong <xiaogu...@cn.fujitsu.com>
AuthorDate: Wed, 9 Dec 2009 17:51:30 +0800
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Wed, 9 Dec 2009 10:59:12 +0100

perf sched: Fix for getting task's execution time

In current code, task's execute time is got by reading
'/proc/<pid>/sched' file, it's wrong if the task is created
by pthread_create(), because every thread task has same pid.

This way also has two demerits:

1: 'perf sched replay' can't work if the kernel is not
compiled with the 'CONFIG_SCHED_DEBUG' option

2: perf tool should depend on proc file system

So, this patch uses PERF_COUNT_SW_TASK_CLOCK to get task's
execution time instead of reading /proc file.

Changelog v2 -> v3:
use PERF_COUNT_SW_TASK_CLOCK instead of rusage() as Ingo's
suggestion

Reported-by: Torok Edwin <edwin...@gmail.com>
Signed-off-by: Xiao Guangrong <xiaogu...@cn.fujitsu.com>
Cc: Xiao Guangrong <ericx...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Frederic Weisbecker <fwei...@gmail.com>
Cc: Paul Mackerras <pau...@samba.org>
LKML-Reference: <4B1F732...@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

0 new messages