I just want to let people know what we're working on at the moment and
perhaps people can help us with pointers to prior art or other things
we've not thought of.
For the Parallel GHC Project , one of the partners we're working with
is using MPI and would like to be able to do performance profiling of
MPI programs run across the whole cluster. So for the last couple weeks
we have been looking at the issue of multi-node tracing.
Our general strategy is:
* to extend the existing ghc event log infrastructure
* to use event log merging to handle logs from multiple
* to extend threadscope to understand new events
* perhaps later, to translate from ghc events to trace/log formats
used by existing open source parallel profiling/visualisation
The ghc event log system  is
"a fast, extensible event logging framework in the GHC run-time
system (RTS) to support profiling of GHC run-time events."
We have discussed our ideas for adding new events for multi-node tracing
with the Eden developers. Eden now uses the ghc event format and has
a new trace viewer called EdenTV . EdenTV, like threadscope, presents
a timeline visualisation and is written in Haskell using cairo + gtk2hs.
While the Eden developers have one use-case to concentrate on, we are
aiming to cover a variety of multi-node tracing use-cases including MPI
programs but also classic client/server programs. We have therefore been
trying to think about flexible approaches that let us record a variety
of information related to multi-node programs.
Tracing multi-node information
Eden adds a set of new events to keep track of Eden processes (a
language construct that Eden adds on top of Haskell) and the
relationship between lightweight Haskell threads, Eden processes and
Eden machines. Eden machines are basically RTS instances, so there's one
eventlog per Eden machine and these get merged to provide a view of a
program that runs across multiple machines.
Our idea is to extend the ghc event log system with a concept called a
A capability is already an important concept in the event log system.
The capability is a GHC RTS concept, threadscope calls them Haskell
Execution Contexts, HECs. It basically corresponds to a CPU core that
runs Haskell code. Most events in the event log are associated with a
capability (in the GHC implementation, each running capability buffers
up the events it generates and occasionally flushes them in a block to
the event log file).
There are events for tracking which Haskell threads belong to which
capabilities (e.g. when threads are created/destroyed/migrated).
So our idea is to extend this system by allowing sets of capabilities to
be identified and to associate information with the sets. For example
the simplest use case would be to make a capability set for all the
capabilities that are running inside a single process / RTS instance and
to label that capability set with the OS process id. Then if we have
event logs for two Haskell programs running on the same machine then we
can merge them and still identify which capabilities belong to which
program / OS process.
The next obvious one is to have a capability set for a physical
machine/host labelled with it's network address. Having merged event
logs then we can see all capabilities from all machines in one view, but
still distinguish which capabilities belong to which machine.
In addition to information associated with each capability set as a
whole, we will allow information to be associated with each member of
the set. One use case for this is to identify MPI groups. Each node in
an MPI group is identified by an index, and communications between
members of the group are labelled with the sender and receiver indexes.
So keeping track of this information would be the first step towards
visualising MPI communications.
For those of you familiar with ghc events, the specific extensions we're
thinking of are:
EVENT_CAPABILITY_SET_CREATE (cap_set, cap_set_type, cap_set_info)
EVENT_CAPABILITY_SET_ASSIGN_MEMBER (cap_set, cap, cap_set_member_info)
EVENT_CAPABILITY_SET_REMOVE_MEMBER (cap_set, cap)
... -- can be extended
Note that a single capability may a member of multiple sets.
An important feature for multi-node tracing is getting a proper time
synchronisation between multiple machines. The current ghc event log
only records time relative to the beginning of the OS process. Eden
extends this with a localtime on each Eden machine startup event. If one
assumes that the local clocks are reasonably close (e.g. using NTP) then
one can use this when merging event logs to match up the time between
machines. Unfortunately we cannot directly re-use the time event that
Eden defines because it is tied to the Eden machine concept.
We want to be able to make use of various different sources of
information for time synchronisation. Ideally we can use the best source
given the context. For example in a cluster we may be able to rely on
NTP but in ad-hoc client/server systems we almost certainly cannot trust
the local times to be synchronised.
This is the area where we are most fuzzy at the moment and would
appreciate pointers to prior art. It is clear that we could add an event
to indicate the local time for a capability, and then on the assumption
that this is synchronised with NTP or equivalent then a log merging tool
can match up times. But for systems where we cannot rely on NTP, what is
the appropriate information that could be recorded in the ghc event log?
The hope is that between these various use cases, that there is some
standard information that we can record in the log to work out how to
match up events from multiple nodes. The idea is that in different use
cases we can obtain the information in different ways but that there is
a common interface for emitting the info into the event log where it can
be used by the merging and visualisation tools. This would take the form
of a Haskell function that emits a time event into the event log. For
example, this would then be used by a client/server application which
might obtain approximate time offset information by sending localtime
timestamps (HTTP sends server time for example). The MPI bindings would
use some other method to obtain the information, e.g. using an MPI
Duncan Coutts, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com/