ghc events for multi-node tracing

26 views
Skip to first unread message

Duncan Coutts

unread,
Feb 9, 2011, 2:16:26 PM2/9/11
to parallel...@googlegroups.com
All,

I just want to let people know what we're working on at the moment and
perhaps people can help us with pointers to prior art or other things
we've not thought of.

For the Parallel GHC Project [1], one of the partners we're working with
is using MPI and would like to be able to do performance profiling of
MPI programs run across the whole cluster. So for the last couple weeks
we have been looking at the issue of multi-node tracing.

Our general strategy is:
* to extend the existing ghc event log infrastructure
* to use event log merging to handle logs from multiple
machines/processes
* to extend threadscope to understand new events
* perhaps later, to translate from ghc events to trace/log formats
used by existing open source parallel profiling/visualisation
tools

The ghc event log system [2] is
"a fast, extensible event logging framework in the GHC run-time
system (RTS) to support profiling of GHC run-time events."

We have discussed our ideas for adding new events for multi-node tracing
with the Eden[3] developers. Eden now uses the ghc event format and has
a new trace viewer called EdenTV [4]. EdenTV, like threadscope, presents
a timeline visualisation and is written in Haskell using cairo + gtk2hs.

While the Eden developers have one use-case to concentrate on, we are
aiming to cover a variety of multi-node tracing use-cases including MPI
programs but also classic client/server programs. We have therefore been
trying to think about flexible approaches that let us record a variety
of information related to multi-node programs.


Tracing multi-node information
------------------------------

Eden adds a set of new events to keep track of Eden processes (a
language construct that Eden adds on top of Haskell) and the
relationship between lightweight Haskell threads, Eden processes and
Eden machines. Eden machines are basically RTS instances, so there's one
eventlog per Eden machine and these get merged to provide a view of a
program that runs across multiple machines.

Our idea is to extend the ghc event log system with a concept called a
"capability set".

A capability is already an important concept in the event log system.
The capability is a GHC RTS concept, threadscope calls them Haskell
Execution Contexts, HECs. It basically corresponds to a CPU core that
runs Haskell code. Most events in the event log are associated with a
capability (in the GHC implementation, each running capability buffers
up the events it generates and occasionally flushes them in a block to
the event log file).

There are events for tracking which Haskell threads belong to which
capabilities (e.g. when threads are created/destroyed/migrated).

So our idea is to extend this system by allowing sets of capabilities to
be identified and to associate information with the sets. For example
the simplest use case would be to make a capability set for all the
capabilities that are running inside a single process / RTS instance and
to label that capability set with the OS process id. Then if we have
event logs for two Haskell programs running on the same machine then we
can merge them and still identify which capabilities belong to which
program / OS process.

The next obvious one is to have a capability set for a physical
machine/host labelled with it's network address. Having merged event
logs then we can see all capabilities from all machines in one view, but
still distinguish which capabilities belong to which machine.

In addition to information associated with each capability set as a
whole, we will allow information to be associated with each member of
the set. One use case for this is to identify MPI groups. Each node in
an MPI group is identified by an index, and communications between
members of the group are labelled with the sender and receiver indexes.
So keeping track of this information would be the first step towards
visualising MPI communications.

For those of you familiar with ghc events, the specific extensions we're
thinking of are:

EVENT_CAPABILITY_SET_CREATE (cap_set, cap_set_type, cap_set_info)
EVENT_CAPABILITY_SET_DELETE (cap_set)
EVENT_CAPABILITY_SET_ASSIGN_MEMBER (cap_set, cap, cap_set_member_info)
EVENT_CAPABILITY_SET_REMOVE_MEMBER (cap_set, cap)

enumeration cap_set_type
= cap_set_type_osprocess
| cap_set_type_host
... -- can be extended

Note that a single capability may a member of multiple sets.


Time synchronisation
--------------------

An important feature for multi-node tracing is getting a proper time
synchronisation between multiple machines. The current ghc event log
only records time relative to the beginning of the OS process. Eden
extends this with a localtime on each Eden machine startup event. If one
assumes that the local clocks are reasonably close (e.g. using NTP) then
one can use this when merging event logs to match up the time between
machines. Unfortunately we cannot directly re-use the time event that
Eden defines because it is tied to the Eden machine concept.

We want to be able to make use of various different sources of
information for time synchronisation. Ideally we can use the best source
given the context. For example in a cluster we may be able to rely on
NTP but in ad-hoc client/server systems we almost certainly cannot trust
the local times to be synchronised.

This is the area where we are most fuzzy at the moment and would
appreciate pointers to prior art. It is clear that we could add an event
to indicate the local time for a capability, and then on the assumption
that this is synchronised with NTP or equivalent then a log merging tool
can match up times. But for systems where we cannot rely on NTP, what is
the appropriate information that could be recorded in the ghc event log?

The hope is that between these various use cases, that there is some
standard information that we can record in the log to work out how to
match up events from multiple nodes. The idea is that in different use
cases we can obtain the information in different ways but that there is
a common interface for emitting the info into the event log where it can
be used by the merging and visualisation tools. This would take the form
of a Haskell function that emits a time event into the event log. For
example, this would then be used by a client/server application which
might obtain approximate time offset information by sending localtime
timestamps (HTTP sends server time for example). The MPI bindings would
use some other method to obtain the information, e.g. using an MPI
global barrier.

[1]: http://groups.google.com/group/parallel-haskell/t/1a9f5e5fb83b6f9a
[2]: http://hackage.haskell.org/trac/ghc/wiki/EventLog
[3]: http://www.mathematik.uni-marburg.de/~eden/
[4]: http://www.mathematik.uni-marburg.de/~eden/?content=trace_main&navi=trace

--
Duncan Coutts, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com/

Manuel M T Chakravarty

unread,
Feb 9, 2011, 8:00:31 PM2/9/11
to parallel...@googlegroups.com
Hi Duncan,

> For the Parallel GHC Project [1], one of the partners we're working with
> is using MPI and would like to be able to do performance profiling of
> MPI programs run across the whole cluster. So for the last couple weeks
> we have been looking at the issue of multi-node tracing.
>
> Our general strategy is:
> * to extend the existing ghc event log infrastructure
> * to use event log merging to handle logs from multiple
> machines/processes
> * to extend threadscope to understand new events
> * perhaps later, to translate from ghc events to trace/log formats
> used by existing open source parallel profiling/visualisation
> tools

Please maintain the integration of the eventlog probes with the dtrace probes. The current setup is quite modular and developed in discussions with SimonM.

Do you need to change anything for that? Isn't it sufficient to maintain a separate structure assigning capabilities to capability sets? Given the capability information that is in the events already, you can already identify the capability set an event belongs to. In other words, I don't think the "capability set" information needs to go into the events.

> The next obvious one is to have a capability set for a physical
> machine/host labelled with it's network address. Having merged event
> logs then we can see all capabilities from all machines in one view, but
> still distinguish which capabilities belong to which machine.

Again, does that require any change? Capabilities won't migrate in the current RTS and even threads can only migrate to HECs using the same shared memory. Different physical nodes (with separate network adapters) will always have separate RTS instances. So, what you really want is to make sure you know from which RTS instance created an event and then for each RTS instance, you want to know the physical node on which it ran, but there is no need to track that information on an event by event basis.

I think this layering (ie, having physical node and capability set information independent of the actual event data) is important for two reasons:

* The size of your data stream is already growing with the number of parallel threads that you are tracing. You don't want to produce even more data by having redundant information in the individual events.

* This is all rather experimental and easier to change if the information is layered.

> Time synchronisation
> --------------------
>
> An important feature for multi-node tracing is getting a proper time
> synchronisation between multiple machines. The current ghc event log
> only records time relative to the beginning of the OS process. Eden
> extends this with a localtime on each Eden machine startup event. If one
> assumes that the local clocks are reasonably close (e.g. using NTP) then
> one can use this when merging event logs to match up the time between
> machines. Unfortunately we cannot directly re-use the time event that
> Eden defines because it is tied to the Eden machine concept.
>
> We want to be able to make use of various different sources of
> information for time synchronisation. Ideally we can use the best source
> given the context. For example in a cluster we may be able to rely on
> NTP but in ad-hoc client/server systems we almost certainly cannot trust
> the local times to be synchronised.
>
> This is the area where we are most fuzzy at the moment and would
> appreciate pointers to prior art. It is clear that we could add an event
> to indicate the local time for a capability, and then on the assumption
> that this is synchronised with NTP or equivalent then a log merging tool
> can match up times. But for systems where we cannot rely on NTP, what is
> the appropriate information that could be recorded in the ghc event log?

I am sceptical about relying on physical clocks. In particular, they might work in some set ups, but will get increasingly problematic with growing networks — i.e., scalability is a worry. This is of course no new problem, and there is quite a bit of literature on the topic. A standard approach are the logical clocks introduced by Leslie Lamport

http://en.wikipedia.org/wiki/Lamport_timestamps

and their various extensions, eg,

http://en.wikipedia.org/wiki/Vector_clocks

(Google Scholar will give you many papers on the topic.)

Vector clocks are used, for example, in the Riak data store:

http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/
http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/

Manuel

Duncan Coutts

unread,
Feb 10, 2011, 5:54:30 AM2/10/11
to parallel...@googlegroups.com
On Thu, 2011-02-10 at 12:00 +1100, Manuel M T Chakravarty wrote:
> Hi Duncan,

> Please maintain the integration of the eventlog probes with the dtrace
> probes. The current setup is quite modular and developed in
> discussions with SimonM.

Ok, will keep that in mind. In some cases the information we want in the
log is not strictly an event. I'll confer with Simon on the details.

> > Tracing multi-node information
> > ------------------------------

> > So our idea is to extend this system by allowing sets of capabilities to
> > be identified and to associate information with the sets. For example
> > the simplest use case would be to make a capability set for all the
> > capabilities that are running inside a single process / RTS instance and
> > to label that capability set with the OS process id. Then if we have
> > event logs for two Haskell programs running on the same machine then we
> > can merge them and still identify which capabilities belong to which
> > program / OS process.
>
> Do you need to change anything for that? Isn't it sufficient to
> maintain a separate structure assigning capabilities to capability
> sets? Given the capability information that is in the events already,
> you can already identify the capability set an event belongs to. In
> other words, I don't think the "capability set" information needs to
> go into the events.

That's right. We're not changing any existing events. All events are
already associated with a capability so we just need to add enough
events so that programs processing the eventlogs can maintain a cap ->
[cap_set] mapping.

> > The next obvious one is to have a capability set for a physical
> > machine/host labelled with it's network address. Having merged event
> > logs then we can see all capabilities from all machines in one view, but
> > still distinguish which capabilities belong to which machine.
>
> Again, does that require any change?

No changes to existing events are required.

> Capabilities won't migrate in the current RTS and even threads can
> only migrate to HECs using the same shared memory. Different physical
> nodes (with separate network adapters) will always have separate RTS
> instances. So, what you really want is to make sure you know from
> which RTS instance created an event and then for each RTS instance,
> you want to know the physical node on which it ran, but there is no
> need to track that information on an event by event basis.

Right.

> I think this layering (ie, having physical node and capability set
> information independent of the actual event data) is important for two
> reasons:
>
> * The size of your data stream is already growing with the number of
> parallel threads that you are tracing. You don't want to produce even
> more data by having redundant information in the individual events.
>
> * This is all rather experimental and easier to change if the
> information is layered.

I agree, that's exactly what we're proposing. Sorry, I should have put
our actual proposed new events earlier on so you'd see more precisely
what we're saying:

> For those of you familiar with ghc events, the specific extensions
> we're thinking of are:
>
> EVENT_CAPABILITY_SET_CREATE (cap_set, cap_set_type, cap_set_info)
> EVENT_CAPABILITY_SET_DELETE (cap_set)
> EVENT_CAPABILITY_SET_ASSIGN_MEMBER (cap_set, cap,
> cap_set_member_info)
> EVENT_CAPABILITY_SET_REMOVE_MEMBER (cap_set, cap)
>
> enumeration cap_set_type
> = cap_set_type_osprocess
> | cap_set_type_host
> ... -- can be extended

That's it. Four events so that the cap -> [cap_set] mapping can be
described (and so it can change over the course of a program run which
may be important for some types of cap set). So in usual cases it's very
few extra events, so minimal extra space use. No existing events are
changed and existing agents can safely ignore the extra information.


> > Time synchronisation
> > --------------------

> I am sceptical about relying on physical clocks.

Me too! My vague thoughts are that we will have two ways of describing
time in the event logs
1. local time
2. something else more sophisticated
Applications should provide the best information they have access to and
eventlog processors should make use of whichever is provided in the log.

I suspect that in many setups we don't have much more information than
than local times. We will in any case want to provide this as a
mechanism for the use case of multiple processes on a single machine,
when the local clock is reliable.

> In particular, they might work in some set ups, but will get
> increasingly problematic with growing networks — i.e., scalability is
> a worry. This is of course no new problem, and there is quite a bit
> of literature on the topic. A standard approach are the logical
> clocks introduced by Leslie Lamport
>
> http://en.wikipedia.org/wiki/Lamport_timestamps
>
> and their various extensions, eg,
>
> http://en.wikipedia.org/wiki/Vector_clocks
>
> (Google Scholar will give you many papers on the topic.)
>
> Vector clocks are used, for example, in the Riak data store:
>
> http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/
> http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/

Thanks for the pointers. I am aware of Lamport timestamps and vector
clocks, though I don't yet fully understand them! In particular I
understand that they tell us about causality (happened before relation)
but I'm not sure how (or if) one can use that information to try to
align/adjust event streams, e.g. in threadscope.

The other issue, is that systems need to be explicitly designed to
include Lamport timestamps or vector clocks and so many setups where
people want to do tracing/profiling will not be able to provide that
information. For example the protocol they're using may preclude adding
it easily.

Duncan

Simon Marlow

unread,
Feb 10, 2011, 7:20:17 AM2/10/11
to parallel...@googlegroups.com
> -----Original Message-----
> From: parallel...@googlegroups.com [mailto:parallel-
> has...@googlegroups.com] On Behalf Of Duncan Coutts
> Sent: 10 February 2011 10:55
> To: parallel...@googlegroups.com
> Subject: Re: ghc events for multi-node tracing
>
> On Thu, 2011-02-10 at 12:00 +1100, Manuel M T Chakravarty wrote:
> > Hi Duncan,
>
> > Please maintain the integration of the eventlog probes with the dtrace
> > probes. The current setup is quite modular and developed in
> > discussions with SimonM.
>
> Ok, will keep that in mind. In some cases the information we want in the
> log is not strictly an event. I'll confer with Simon on the details.

I think it's fine for these events to belong to the binary event log only. There are other events of this kind already (EVENT_STARTUP, EVENT_SHUTDOWN, EVENT_BLOCK_MARKER). I imagine that capability sets are probably not relevant to dtrace.

If dtrace users would find it useful to have these events too, then they could be lifted to the Trace framework (rts/Trace.{h,c}).

The layering is quite straightforward, it's like this:

RTS
|
Trace
/ | \
/ | \
/ | \
dtrace stdio EventLog

So the RTS talks to the Trace API to generate events, and Trace talks to either dtrace, stdio, or EventLog. Well, it's not quite that simple - dtrace is *always* enabled on platforms that support it, and you also get either EventLog or stdio depending on the RTS flags (-l or -v).

Cheers,
Simon

Manuel M T Chakravarty

unread,
Feb 11, 2011, 7:07:56 AM2/11/11
to parallel...@googlegroups.com
Yes, the plan was that some events/information may be exclusive to one or more frameworks. That's perfectly reasonable. I was just concerned about the overall setup.

Manuel

Simon Marlow:

Reply all
Reply to author
Forward
0 new messages