Tools for Profiling Build Clusters to Aid in Performance Tuning

36 views
Skip to first unread message

Tim Black

unread,
Jan 16, 2020, 12:08:41 PM1/16/20
to icecream-users
I have gleaned some best practices for setting up icecc clusters and tuning their performance, from this group, and from the issues on icecc github. To make these tuning decisions, one needs real information about what happened during a distributed build scenario.

We've set up a small icecc build cluster, are having trouble identifying bottlenecks, and are looking for suggestions for how to systematically collect build profiling information to help with the tuning effort. We currently have:
  1. -vvv and -l args set in the daemons, and have access to the verbose console output from the machine initiating the build, as well as the iceccd logs from all machines.
  2.  real-time stats collected from the scheduler and provided by icemon and (my preferred) icecream-sundae monitors.
  3. I understand that restarting scheduler(s) will reset the stats that are displayed by these tools
While these do provide useful information for post-analysis and real-time monitoring, I'm looking for tools for measuring all of the details of a build scenario. We can post-process the console output to collect per-compile job information, but I'm also interested in gathering snapshots of the cluster state before, during and after a build scenario. I would use these snapshots as my barometer while tuning the system parameters. Looks like icemon and icecream-sundae don't have the ability to take "one shot" snapshots of the metrics that they're displaying.
 
I'm coming from distcc, which had a little-known feature that would serve build cluster statistics on a TCP port, by enabling these daemon options:

--stats
Turn on the statistics HTTP server. By default it is off. (Daemon mode only.)
--stats-port PORT
Set the TCP port to listen on for HTTP requests, rather than the default of 3633. (Daemon mode only.)

Is there a designed-in way to collect this info with these (or other) tools? I am now inspecting the source code of icecream-sundae to learn more about collecting stats from schedulers and/or iceccd. Thanks.

Lubos Lunak

unread,
Jan 16, 2020, 6:02:04 PM1/16/20
to icecrea...@googlegroups.com
On Thursday 16 of January 2020, Tim Black wrote:
> I have gleaned some best practices for setting up icecc clusters and tuning
> their performance, from this group, and from the issues on icecc github. To
> make these tuning decisions, one needs real information about what happened
> during a distributed build scenario.
>
> We've set up a small icecc build cluster, are having trouble identifying
> bottlenecks, and are looking for suggestions for how to systematically
> collect build profiling information to help with the tuning effort. We
> currently have:
>
> 1. -vvv and -l args set in the daemons, and have access to the verbose
> console output from the machine initiating the build, as well as the
> iceccd logs from all machines.
> 2. real-time stats collected from the scheduler and provided by icemon
> and (my preferred) icecream-sundae monitors.
> 3. I understand that restarting scheduler(s) will reset the stats that
> are displayed by these tools
>
> While these do provide useful information for post-analysis and real-time
> monitoring, I'm looking for tools for measuring all of the details of a
> build scenario. We can post-process the console output to collect
> per-compile job information, but I'm also interested in gathering snapshots
> of the cluster state before, during and after a build scenario.

Try "telnet <scheduler> 8766". Something like this could get you a snapshot
of some data:

expect << EOF
spawn telnet localhost 8766
expect "200 Use 'help' for help and 'quit' to quit."
send "listcs\r"
expect "200 done"
send "quit\r"
interact
EOF

Together with logs and icemon's various views this generally gives me all the
info I need, so if you need more, please be more specific.

--
Lubos Lunak

Tim Black

unread,
Jan 17, 2020, 6:28:53 PM1/17/20
to icecream-users
Thanks so much for this. The word telnet appears once on the icecream readme, in reference only to port numbers the apps use. I will try your "lists" command, but Do you know of any documentation of the telnet interface (other than the source code of course)?

Lubos Lunak

unread,
Jan 18, 2020, 4:38:36 AM1/18/20
to icecrea...@googlegroups.com
There's no documentation, but it should be fairly trivial and
self-explanatory. In case you are not familiar with the "expect" syntax, the
example I pasted is the equivalent of manually running "telnet localhost
8766" and then typing:
listcs
quit

Just telnet to your scheduler machine this way, type "help" and then try what
each of the commands does.

--
Lubos Lunak

Tim Black

unread,
Jan 18, 2020, 11:20:57 AM1/18/20
to icecream-users
Thanks Lubos. I agree it's mostly self-explanatory once you're aware this interface exists and are using it, I just feel that it should be mentioned in the documentation that it exists and what it can be used for. Maybe I'll submit a PR with some suggested doc changes. 

About the details of telnet interface, I would like to see a description of what each command returns. 'help' tells me what commands there are, sure, but I would think any person relatively new to icecc would have the questions "what's a block?", "what's a kid?", "what does 'cs' stand for?", "are these stats shared between all schedulers on the network, or does only the active scheduler know them?"...

JPEWDev forwarded to me a presentation he made last summer which included a really good high-level explanation of how daemons and schedulers and client shims work together and can be configured. (https://youtu.be/VpK27pI64jQ) I would love to get any more references that anyone has found useful in ramping up quickly on the details/internals of how icecc works. 

On Monday (PST), I'll resume work on troubleshooting our icecc cluster, using the telnet interface to glean details, trying to find out why I'm not seeing any performance improvement, for a fairly highly parallelized c++ project using meson/ninja. 

Lubos Lunak

unread,
Jan 31, 2020, 4:00:14 AM1/31/20
to icecrea...@googlegroups.com

Hello,

I apologize for the delayed answer.

On Saturday 18 of January 2020, Tim Black wrote:
> Thanks Lubos. I agree it's *mostly *self-explanatory once you're aware this
> interface exists and are using it, I just feel that it should be mentioned
> in the documentation that it exists and what it can be used for. Maybe I'll
> submit a PR with some suggested doc changes.

That would be welcome.

> About the details of telnet interface, I would like to see a description of
> what each command returns. 'help' tells me what commands there are, sure,
> but I would think any person relatively new to icecc would have the
> questions "what's a block?", "what's a kid?", "what does 'cs' stand for?",
> "are these stats shared between all schedulers on the network, or does only
> the active scheduler know them?"...

I think a person relatively new to icecc doesn't actually need to know these.
In fact, most probably don't, usually it should be enough to just set it up
and use it. It didn't occur first to me to ask, but why do you actually need
to know these things? I don't think there's much you can do to optimize
icecc, except for changing the code itself. The two primary things that
AFAICT decide performance are performance of icecc itself and how the
scheduler decides to distribute the builds.

That said:

kid = job = one compilation process run on a node
cs = compile server = icecc node
block = blacklist (presumably you mean the 'blockcs' command)
stats sharing - there is only one active scheduler on a network (unless you
intentionally split it by using different icecc netnames), others are
suspended

--
Lubos Lunak
Reply all
Reply to author
Forward
0 new messages