100% CPU load from tup

Slawomir Czarko

unread,

Jan 26, 2012, 5:02:49 AM1/26/12

to tup-users

I'm not sure anybody else has seen that but sometimes the build is
very slow and I can see that tup process is using about 100% CPU (that
would actually be 100% of one CPU core) while the compiler processes
use only about 5-10% each. So it looks like for some reason tup
becomes the bottleneck. I can observe this when the build is already
running the build tasks so all the parsing is done etc. I'd expect
that at this stage tup doesn't have much to do.

-Slawomir

Mike Shal

unread,

Jan 27, 2012, 3:17:57 PM1/27/12

to tup-...@googlegroups.com

Is your disk by chance encrypted? I wonder if the cpu load to
encrypt/decrypt blocks is getting attributed to the tup process. The
two things going on while running sub-processes are:
1) The fuse thread processing fs requests. Ultimately all reads/writes
have to go through here (which is what accesses the underlying disk),
so maybe tup is taking the blame for that. I don't think there is much
else here that would be CPU-bound.
2) For finished jobs, tup does the post-processing on the files that
were read/written. Some of this is currently inefficient (like
O(#files^2)... I have a patch in the works but still needs to be
ported to windows :-/), so if a single sub-process accesses a lot of
files that could be an additional bottleneck. The file access data
must then be compared to the existing db contents to see what new
links need to be written / removed, etc.

Ultimately we will probably need to profile it to find out where the
real slowdown is. If it's in 1), then perhaps going back to a
multi-fuse (as mentioned with the ulimit issue), or at least to a
multi-threaded fuse loop will help. If it's in 2), we would need to
replace any dumb existing algorithms with less-dumb ones.

-Mike

Slawomir Czarko

unread,

Jan 30, 2012, 11:28:02 AM1/30/12

to tup-users

On Jan 27, 9:17 pm, Mike Shal <mar...@gmail.com> wrote:
> On Thu, Jan 26, 2012 at 5:02 AM, Slawomir Czarko
>

Hi,

No, the disk is not encrypted. So maybe it's 2).

-Slawomir

Mike Shal

unread,

Jan 30, 2012, 7:40:10 PM1/30/12

to tup-...@googlegroups.com

Ok, I'll have to try profiling and see if there's anything obvious
that can be improved. Might be some time though, I'm a bit swamped at
the moment...

-Mike

Slawomir Czarko

unread,

Feb 2, 2012, 3:20:56 AM2/2/12

to tup-users

I can do profiling. I can use cachegrind or oprofile.

-Slawomir

Slawomir Czarko

unread,

Apr 25, 2012, 8:11:45 AM4/25/12

to tup-...@googlegroups.com

Still seeing this. I switched from ext4 to ext3 and it reduced load from tup from 100% of one core to about 70%. tup becomes a bottleneck in the build - on a 6 core machines I'm running 9 parallel builds but it looks like only 50% of cores are being utilized. If I increase number of parallel jobs the load from tup goes up as well. Compile commands execute at a rate of about 1 per second so it shouldn't be cause such a massive load in tup.

Any idea how to troubleshoot/debug/profile this?

-Slawomir

Mike Shal

unread,

Apr 25, 2012, 9:43:03 PM4/25/12

to tup-...@googlegroups.com

As a quick test can you try to replace "fuse_loop(fs.fuse)" with
"fuse_loop_mt(fs.fuse)" in src/tup/server/fuse_server.c ? This will
run fuse in multi-threaded mode. It runs single-threaded by default to
avoid reports from helgrind against libfuse, but I could work around
that with a suppression file if necessary. If that doesn't help at all
I guess we'd need to benchmark further - maybe compare against a fuse
example file-system to see how much of an effect just using fuse has.

Thanks,
-Mike

Anatol Pomozov

unread,

Apr 25, 2012, 9:57:33 PM4/25/12

to tup-...@googlegroups.com

Hi

On Wed, Apr 25, 2012 at 6:43 PM, Mike Shal <mar...@gmail.com> wrote:

On Wed, Apr 25, 2012 at 8:11 AM, Slawomir Czarko
<slawomi...@gmail.com> wrote:
> Still seeing this. I switched from ext4 to ext3 and it reduced load from tup
> from 100% of one core to about 70%. tup becomes a bottleneck in the build -
> on a 6 core machines I'm running 9 parallel builds but it looks like only
> 50% of cores are being utilized.

Remember that tools like top and htop do not

If I increase number of parallel jobs the
> load from tup goes up as well. Compile commands execute at a rate of about 1
> per second so it shouldn't be cause such a massive load in tup.
>
> Any idea how to troubleshoot/debug/profile this?

As a quick test can you try to replace "fuse_loop(fs.fuse)" with
"fuse_loop_mt(fs.fuse)" in src/tup/server/fuse_server.c ? This will
run fuse in multi-threaded mode. It runs single-threaded by default to
avoid reports from helgrind against libfuse, but I could work around
that with a suppression file if necessary. If that doesn't help at all
I guess we'd need to benchmark further - maybe compare against a fuse
example file-system to see how much of an effect just using fuse has.

Thanks,
-Mike

--
tup-users mailing list
email: tup-...@googlegroups.com
unsubscribe: tup-users+...@googlegroups.com
options: http://groups.google.com/group/tup-users?hl=en

Anatol Pomozov

unread,

Apr 25, 2012, 10:06:33 PM4/25/12

to tup-...@googlegroups.com

Hi

Sorry accidentally pressed send button.

On Wed, Apr 25, 2012 at 6:43 PM, Mike Shal <mar...@gmail.com> wrote:

On Wed, Apr 25, 2012 at 8:11 AM, Slawomir Czarko
<slawomi...@gmail.com> wrote:
> Still seeing this. I switched from ext4 to ext3 and it reduced load from tup
> from 100% of one core to about 70%. tup becomes a bottleneck in the build -
> on a 6 core machines I'm running 9 parallel builds but it looks like only
> 50% of cores are being utilized.

top and htop tools have a known issue is that it reports resources on proceses that active at the beginning and the end of measurement interval. So short living commands are not taken into account. You might want to try atop tool - it claims it does not have such issue. http://www.atoptool.nl/

The problem can be also in disk performance.

If I increase number of parallel jobs the
> load from tup goes up as well. Compile commands execute at a rate of about 1
> per second so it shouldn't be cause such a massive load in tup.
>
> Any idea how to troubleshoot/debug/profile this?

As a quick test can you try to replace "fuse_loop(fs.fuse)" with
"fuse_loop_mt(fs.fuse)" in src/tup/server/fuse_server.c ? This will
run fuse in multi-threaded mode. It runs single-threaded by default to
avoid reports from helgrind against libfuse, but I could work around
that with a suppression file if necessary. If that doesn't help at all
I guess we'd need to benchmark further - maybe compare against a fuse
example file-system to see how much of an effect just using fuse has.

I believe the best way to figure out what is going on is to run tup with profiler or some other tool that shows hot functions, hot mutexes (there tup is blocked most of the time), etc...

Mike Shal

unread,

Apr 25, 2012, 11:04:07 PM4/25/12

to tup-...@googlegroups.com

Yeah, I was just curious if multi-threading fuse would help unblock
the idle cores. Profiling would hopefully help point out the problem,
though :)

-Mike

Slawomir Czarko

unread,

Apr 27, 2012, 2:18:17 PM4/27/12

to tup-...@googlegroups.com

On Thursday, April 26, 2012 3:43:03 AM UTC+2, mar...@gmail.com wrote:

On Wed, Apr 25, 2012 at 8:11 AM, Slawomir Czarko

This doesn't help. Actually it makes the build a tiny bit longer (11 minutes vs 10:54 for single threaded fuse).

What I don't understand is that cpu load from tup is higher when there's less commands executed per second and lower when commands are executed faster.

I'm using ccache so when ccache already has all the object files and everything is in filesystem cache the full build takes 22 seconds with CPU load from tup at 55%.

When ccache is clean then the full build takes 10 minutes 54 seconds with CPU load from tup at 88%.

Number of commands executed in both cases is 2387. So that gives 3.65 command / sec => 88% cpu load from tup and 108.5 command / sec => 55% cpu load from tup

-Slawomir

Mike Shal

unread,

Apr 27, 2012, 5:24:01 PM4/27/12

to tup-...@googlegroups.com

This makes some sense (I think?) - if tup is busy doing some work,
maybe it doesn't have a chance to run more processes. I think tup
should definitely scale better wrt. job parallelization than what
you're seeing, so we'll have to benchmark to find out what is holding
it up. (I just have a dual-core machine, so I haven't seen it).

I do have a bit of work to do as far as getting some branches merged
before I'll have a chance to profile - if you want to try to build tup
with profiling options and see if it points to a specific function
that would help move things along. Hopefully it's something simple
like some dumb O(n^2) algorithm lurking around...

Thanks,
-Mike

Slawomir Czarko

unread,

May 3, 2012, 4:35:21 AM5/3/12

to tup-...@googlegroups.com

On Friday, April 27, 2012 11:24:01 PM UTC+2, mar...@gmail.com wrote:

This makes some sense (I think?) - if tup is busy doing some work,
maybe it doesn't have a chance to run more processes. I think tup
should definitely scale better wrt. job parallelization than what
you're seeing, so we'll have to benchmark to find out what is holding
it up. (I just have a dual-core machine, so I haven't seen it).

I do have a bit of work to do as far as getting some branches merged
before I'll have a chance to profile - if you want to try to build tup
with profiling options and see if it points to a specific function
that would help move things along. Hopefully it's something simple
like some dumb O(n^2) algorithm lurking around...

Any specific options I should use for profiling?

-Slawomir

Slawomir Czarko

unread,

May 4, 2012, 8:03:13 AM5/4/12

to tup-...@googlegroups.com

Attached is callgrind output from running tup. There's no obvious hotspots.

-Slawomir

callgrind.out.2872

Slawomir Czarko

unread,

May 4, 2012, 8:06:22 AM5/4/12

to tup-...@googlegroups.com

On Friday, April 27, 2012 11:24:01 PM UTC+2, mar...@gmail.com wrote:

On Fri, Apr 27, 2012 at 2:18 PM, Slawomir Czarko

I'm not sure I understand.

The way I see it in both cases tup is doing exactly the same work as commands to execute are the same. In one case (with ccache cold) commands take longer to execute and in the other (with ccache hot) commands execute faster. That shouldn't affect amount of work done by tup unless it has a busy loop somewhere waiting for child processes to complete.

-Slawomir

Mike Shal

unread,

May 8, 2012, 4:43:44 PM5/8/12

to tup-...@googlegroups.com

Thanks for the callgrind output - unfortunately as you say there's
nothing too obvious. The database takes up a lot of time, as does
managing trees of tup nodes, but that is to be expected since that's
most of the work. There are almost certainly improvements that can be
done in both of those cases, but I don't know how significant of an
increase it would be. I still find it strange that it can't keep each
core busy. Maybe other benchmarking methods would paint a different
picture? Perhaps something that monitors IO as well? I'm not really
sure where to go from here...

Thanks,
-Mike

Slawomir Czarko

unread,

May 14, 2012, 8:33:09 AM5/14/12

to tup-...@googlegroups.com

On Tuesday, May 8, 2012 10:43:44 PM UTC+2, mar...@gmail.com wrote:

On Fri, May 4, 2012 at 8:06 AM, Slawomir Czarko

iostat shows less than 1% of CPU time is spent in iowait state and I/O during build is less than 4MB/s with less than 100 tps (transfers per second). Disk is SSD.

I even tried putting tup db on a ramdisk but there was no change.

What I do is:

1) tup touch some/code/*

2) tup upd -j9

3) I use this command to measure CPU load from tup plus total CPU time spent by tup:

ps -o '%cpu cputime' `ps aux | grep "tup upd" | grep -v grep | awk '{print $2}'`

I execute it just before tup finishes.

With ccache cold I get:

%CPU TIME

0.0 00:00:00

67.2 00:01:21

0.1 00:00:00

Total build time was 122.7 seconds so these values match more or less.

With ccache warm I get:

%CPU TIME

0.0 00:00:00

58.3 00:00:04

1.8 00:00:00

Total build time was 7.7 seconds so these values match also.

As far as I know the work done by tup is the same regardless if ccache is warm or cold so there should be no difference in the CPU time spent in tup. In both cases tup starts the compiler which then reads the source files, all the includes, runs preprocessor and then either copies already compiled object file from ccache (when it's warm) or runs the actual compilation (when it's cold). ccache files are not monitored by tup since they're outside the tup build tree.

What I see is about 20x more CPU time spent in tup when ccache is cold than when it's warm. Any idea how to explain this?

-Slawomir

Mike Shal

unread,

May 22, 2012, 8:37:15 PM5/22/12

to tup-...@googlegroups.com

I don't exactly know why this is, but it looks when it is cold it
accesses the header files more times than when the cache is warm. I
temporarily added some debug to tup to print out each file access, and
in one case I tried with a warm cache there were 1171 accesses, and
with a cold cache the same compilation had 3807 accesses. Tup has to
process each one, which somewhat explains the extra CPU time (I have a
branch in progress that tries to reduce this by avoiding duplicate
work if the same file is accessed multiple times, but it isn't merged
yet). Though in my tests using the same cpu usage command that you
were, I only saw about a 2x increase in cpu for the cold case vs the
warm case. I have no idea why you're seeing 20x more time...

-Mike

Slawomir Czarko

unread,

May 24, 2012, 4:35:20 PM5/24/12

to tup-...@googlegroups.com

On Wednesday, May 23, 2012 2:37:15 AM UTC+2, mar...@gmail.com wrote:

I don't exactly know why this is, but it looks when it is cold it
accesses the header files more times than when the cache is warm. I
temporarily added some debug to tup to print out each file access, and
in one case I tried with a warm cache there were 1171 accesses, and
with a cold cache the same compilation had 3807 accesses. Tup has to
process each one, which somewhat explains the extra CPU time (I have a
branch in progress that tries to reduce this by avoiding duplicate
work if the same file is accessed multiple times, but it isn't merged
yet). Though in my tests using the same cpu usage command that you
were, I only saw about a 2x increase in cpu for the cold case vs the
warm case. I have no idea why you're seeing 20x more time...

-Mike

Can you attach a source file for which you see this behaviour?

I tried running strace on compilation of a simple cpp file but I couldn't see any difference in the opened header files regardless whether ccache is cold or warm.

- Slawomir

Mike Shal

unread,

May 24, 2012, 9:03:18 PM5/24/12

to tup-...@googlegroups.com

The difference in file accesses that I'm seeing looks like it has to
do with CCACHE_NODIRECT. In the attached test case, if CCACHE_NODIRECT
is not set, then in the uncached version tup reads from 77 files to
compile foo.c while it only reads from 12 files if it is cached. (The
number of files were displayed using the attached patch). Confusingly,
ccache somehow knows not to try to read from a/foo.h, which doesn't
exist. This causes the build to break when a/foo.h is created, since
the command should be executed when a header is created earlier in the
include path.

With CCACHE_NODIRECT exported, it reads from about the same amount of
files (73 and 72, respectively), and the update works correctly.

So, it seems to me that ccache without CCACHE_NODIRECT is broken. Are
you running with it set or not set? Keep in mind to use it within tup
you'll need to have it set in your shell's environment, and also
explicitly do an 'export CCACHE_NODIRECT' in the Tupfile before the
:-rule that uses ccache.

-Mike

t4074-ccache.sh

ok.patch

Slawomir Czarko

unread,

May 25, 2012, 4:37:46 AM5/25/12

to tup-...@googlegroups.com

On Friday, May 25, 2012 3:03:18 AM UTC+2, mar...@gmail.com wrote:

On Thu, May 24, 2012 at 4:35 PM, Slawomir Czarko

I don't use CCACHE_NODIRECT.

I did this:

$ strace ccache gcc -o /tmp/foo.o -c /tmp/foo.c > cold.log 2>&1

$ strace ccache gcc -o /tmp/foo.o -c /tmp/foo.c > hot.log 2>&1

$ grep open cold.log | wc -l

50

$ grep open hot.log | wc -l

42

the extra 8 files opened in cold case are in ccache so they don't matter

$ grep open cold.log | grep include | wc -l

30

$ grep open hot.log | grep include | wc -l

30

The list of header files opened in both cases is exactly the same.

Can you confirm that difference in files being opened doesn't have something to do with tup itself?

- Slawomir

Mike Shal

unread,

May 25, 2012, 11:02:17 AM5/25/12

to tup-...@googlegroups.com

Yes, this is with ccache-3.1.6, just using the command line (not tup) -

$ cat ok-strace.sh
ccache -C
strace -f ccache gcc -c -Ia -Ib bar.c -o bar.o 2>strace-cold.txt
strace -f ccache gcc -c -Ia -Ib bar.c -o bar.o 2>strace-warm.txt

$ cat bar.c
#include "foo.h"

$ ls a
(empty)

$ ls b
foo.h

$ sh ok-strace.sh
Cleared cache

$ grep foo.h strace-cold.txt
read(4, "#include \"foo.h\"\n", 1024) = 17
[pid 30924] read(4, "#include \"foo.h\"\n", 17) = 17
[pid 30924] open("foo.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file
or directory)
[pid 30924] open("a/foo.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such
file or directory)
[pid 30924] open("b/foo.h", O_RDONLY|O_NOCTTY) = 4
stat64("b/foo.h", {st_mode=S_IFREG|0644, st_size=14, ...}) = 0
open("b/foo.h", O_RDONLY) = 4

$ grep foo.h strace-warm.txt
read(4, "#include \"foo.h\"\n", 1024) = 17
stat64("b/foo.h", {st_mode=S_IFREG|0644, st_size=14, ...}) = 0
open("b/foo.h", O_RDONLY) = 5

In the cold case, the sub-process (pid 30924, which is cc1 -E) tries
to read from a/foo.h, so tup knows to add the dependency. In the warm
case, ccache only tries to read from b/foo.h, so the dependency on
a/foo.h goes away. If I export CCACHE_NODIRECT=1 first, then both the
cold and warm cases have the same foo.h accesses (since ccache
executes the pre-processor in both).

-Mike

Slawomir Czarko

unread,

May 25, 2012, 4:07:45 PM5/25/12

to tup-...@googlegroups.com

OK, I get it now.

- Slawomir

Reply all

Reply to author

Forward