Design notes and request for feedback on cloud haskell transport layer interface

221 views

Skip to first unread message

Duncan Coutts

unread,

Nov 2, 2011, 3:32:21 PM11/2/11

to parallel...@googlegroups.com

All,

This is a note and request for feedback about using cloud Haskell with
high performance cluster interconnects.

A couple weeks ago a small group of people interested in cloud Haskell
and HPC met at MSR in Cambridge to think about how we might adjust the
cloud Haskell implementation (currently the 'remote' package) to work on
fast cluster interconnects. This note is to summarise our discussion and
present some (somewhat) concrete design ideas for further general
discussion. We'd be very happy to get input and opinions from other
interested people.

The approach we want to take is to abstract over a network transport
layer. That is, we want to define an interface between the upper layers
of the remote library (Process monad and all that) and the network
transport layer. We'd like to keep the upper layers essentially the same
and be able to switch transport layers.

There are some obvious transports:
* ordinary IP sockets with tcp
* some HPC interconnect backend (e.g. MPI, or Peter Braam is keen
on using the CCI library [1]).

And some other plausible ones:
* in-process dummy/test, e.g. MVar
* ZeroMQ sockets
* IP sockets with udp
* multi-process+pipes (e.g. fork one process per core)
* multi-node based on SSH (putting a master node in control of
contacting nodes via ssh and running the same binary on each.
This would have low networking performance but should be pretty
easy to deploy & use.)

[1]: http://www.olcf.ornl.gov/center-projects/common-communication-interface/

Additionally, if we can come up with a useful transport layer interface
and a couple good implementations then these might be useful for other
distributed communication libraries in Haskell (authors of such libs can
then choose between building on top of the remote package, or building
on top of the lower level network interface).

As a side note, this transport layer might be a good place to add
encryption or authentication (and the configuration thereof) for use in
insecure IP networks.

Transport layer interface
=========================

So then the question is: what should a transport layer interface look
like?

The basic thing we want is point-to-point send & receive of variable
size blobs of bytes. We will put off consideration of other services
like broadcast.

We need some notion of network endpoint and/or address. These will of
course have to be abstract since the details will vary between network
backends. The endpoints are used for sending and receiving messages:

send :: SendEnd -> ByteString -> IO ()
receive :: ReceiveEnd -> IO ByteString

For reasons we will detail shortly, we follow the cloud Haskell design
of separating the types of the send and receive endpoints.

Send endpoint/address serialisation
-----------------------------------

We want to be able to copy a send endpoint anywhere in the network and
use it to send messages to the receive endpoint. This requires that send
endpoints be serialisable. However, most network endpoints are
associated with some resources, and are not simply an address as a piece
of data.

Thus a critical design decision is whether deserialisation of a send
endpoint is pure or if it is in IO. If it is in IO then a SendEnd can be
a stateful object (e.g. could hold an IP socket). On the other hand if
deserialisation is pure then a SendEnd can only be a piece of data and
it would be 'send' that would have to do all the work of initialising
any stateful network connections.

It is tempting to require pure deserialisation of SendEnds. This would
allow us to use a simple interface for creating connections:

newConn :: IO (SendEnd, ReceiveEnd)

Unfortunately this would impose a certain design and a certain overhead
on implementations. Consider what 'send' would have to do given a
SendEnd. For a connection based network interface like TCP it would need
to take the address contained in the SendEnd and look up in some cache
to see if there is an existing open connection to the given address. If
so it would send, otherwise it would establish a new connection and
send. For a high performance low latency network interconnect this cache
lookup might be an unacceptable overhead. Note also that the cost of
performing a send is not obvious: sometimes it will be quick and other
times it will have to establish a new connection.

Note that this connection cache lookup is exactly how the current Cloud
Haskell implementation works. The cache lookup design is a consequence
of having pure ProcessId and SendPort deserialisation in Cloud Haskell.
So while this overhead will exist in Cloud Haskell given the current
Cloud Haskell design, we have a choice about whether we push the cache
lookup into the transport layer, or do it above the transport layer in
the Cloud Haskell implementation.

I think the better design is for the transport layer to be as low
overhead as reasonable, with a reasonably clear cost model, and that if
we want to keep pure deserialisation for Cloud Haskell that Cloud
Haskell should pay that cost itself.

Thus my suggestion is that we have:

newConn :: IO (SendAddr, ReceiveEnd)
connect :: SendAddr -> IO SendEnd

plus pure serialisation and deserialisation for SendAddr.
(BTW, the exact names here are all open for suggestions)

This now allows implementations where both the SendEnd and ReceiveEnd
can be direct references to stateful objects (e.g. IP sockets) which in
turn allows low overhead implementations of 'send'.

Blocking / non-blocking and connection properties
-------------------------------------------------

Now for sending or receiving messages, one important design decision is
how it interacts with Haskell lightweight threads. Almost all
implementations are going to consist of a Haskell-thread blocking layer
built on top of a non-blocking layer. We can choose to put the transport
interface at the blocking or non-blocking layer. After a bit of
discussion we decided to go for a design that is blocking at the Haskell
thread level. This makes the backend responsible for mapping blocking
calls into something non-blocking at the OS thread level. That is, the
backend must ensure that a blocking send/receive only blocks the Haskell
thread, not all threads on that core. In the general situation we
anticipate having many Haskell threads blocked on network IO while other
Haskell threads continue doing computation. (In an IP backend we get
this property for free because the existing network library implements
the blocking behaviour using the IO manager.)

So the basic send/receive interface would look like:

send :: SendEnd -> ByteString -> IO ()
receive :: ReceiveEnd -> IO ByteString

Although we have said that this is a blocking interface, we have not
specified the behaviour of send/receive in any detail. I think we want
reliable ordered delivery of arbitrary sized messages.

More specifically:
* message/datagram (rather than stream oriented)
* arbitrary message size
* messages delivered in-order
For messages sent on the same SendEnd (but not
necessarily SendAddr).
* reliable
to the degree that messages are delivered at most once
and subsequent messages are not delivered until earlier
ones are
* somewhat asynchronous send permitted:
* send is not synchronous, send completing does not imply
sucessful delivery
* send side buffering is permitted but not required
* receive side buffering is permitted but not required
* send may block (e.g. if too much data is in flight or
destination buffers are full)
* mismatched send/receive permitted
It is not an error to send without a thread at the other
end already waiting in receive (but it may block).

These properties are based on what we can get with (or build on top of)
tcp/ip, udp/ip, unix IPC, MPI and the CCI HPC transport. In particular
CCI emphasises the property that a node should be able to operate with
receive buffer size that is independent of the number of
connections/nodes it communicates with (unlike tcp/ip which has a buffer
per connection). Also, CCI allows unexpected receipt of small messages
but requires pre-arrangement for large transfers (so the receive side
can prepare buffers).

This is a pretty general purpose connection type, supporting as it does
reliable ordered delivery of arbitrary sized messages. We may wish in
future to consider adding extra functions for creating more special-case
connections, e.g. unordered, unreliable or limitations on message size.
e.g.

newConnUnreliable :: IO (SendEnd, ReceiveEnd)

Use in cloud Haskell implementation, part #1
============================================

Suppose we have a network transport that provides the above interface,
how would we rebuild a cloud Haskell implementation on top of that?

We will make the assumption that the communication end points provided
by our network transport are sufficiently lightweight that we can use
one per cloud Haskell process (for each processes' input message queue)
and one per cloud Haskell typed channel. The transport backend can
decide if it wants to do multiplexing or whatever.

Requirements for NodeID and ProcessId
-------------------------------------

A ProcessId serves two purposes, one is to communicate with a process
directly and the other is to talk about that process to various service
processes.

The main APIs involving a ProcessId are:

getSelfPid :: ProcessM ProcessId
send :: Serializable a => ProcessId -> a -> ProcessM ()
spawn :: NodeId -> Closure (ProcessM ()) -> ProcessM ProcessId

and linking and service requests:
linkProcess :: ProcessId -> ProcessM ()
monitorProcess :: ProcessId -> ProcessId -> MonitorAction -> ProcessM ()
nameQuery :: NodeId -> String -> ProcessM (Maybe ProcessId)

A NodeId is used to enable us to talk to the service processes on a
node.

The main APIs involving a NodeId are:

getSelfNode :: ProcessM NodeId
spawn :: NodeId -> Closure (ProcessM ()) -> ProcessM ProcessId
nameQuery :: NodeId -> String -> ProcessM (Maybe ProcessId)

NodeID and ProcessId representation
-----------------------------------

So for a ProcessId we need:
* a communication endpoint to the process message queue so we can
talk to the process itself
* the node id and an identifier for the process on that node so
that we can talk to node services about that process

data ProcessId = ProcessId SendEnd NodeId LocalProcessId

For a NodeId we need to be able to talk to the service processes on that
node.

data NodeId = NodeId SendEnd SendEnd ...

The multiple 'SendEnd's are for talking to the basic service processes
(ie the processes involved in implementing spawn and link/monitor)

cloud Haskell channel representation
------------------------------------

A cloud Haskell channel SendPort is similar to a ProcessId except that
we do not need the NodeId because we do not need to talk about the
process on the other end of the port.

data SendPort a = SendPort SendEnd

Selecting a cloud Haskell transport backend
===========================================

While in principle it might be possible to pick just one transport layer
to rule them all (e.g. some C lib that abstracts over IP and various HPC
network systems) it seems likely in practice that we will need at least
two backends, one for IP using the existing ordinary Haskell networking
libraries, and another for HPC.

So we have to consider how we would select a backend. Suppose we have
the package 'remote' and a couple transport backends providing the
interface described above (newConn/send/receive): 'remote-ip',
'remote-hpc'.

There are two basic approaches we have considered. One is to select the
backend statically. That is, we would have package 'remote' depend on
either remote-ip or remote-hpc, controlled by a configure-time flag. So
the 'remote' package would be recompiled to select a different backend.

Another approach is to select the backend dynamically at runtime. The
application would initialises a backend by calling the backend package
directly, then this backend is used to run a ProcessM () action. Apart
from initialisation (perhaps including peer discovery), all other code
uses just the 'remote' package interface and not the backend package.

This kind of dependency arrangement is fairly common, e.g. the Haskell
database libraries do it (e.g. HDBC). In this model 'remote-ip' and
'remote-hpc' would actually depend on 'remote' and not the other way
around.

+---------------+
| application |
+---------------+
| \--------\
| +-------------------+
| | cloud haskell |
| | transport backend |
| +-------------------+
| /--------/ |
+---------------+ |
| cloud haskell | +---------------+
+---------------+ | C network lib |
+---------------+
| OS / hardware |
+---------------+

For example, supposing in our application's 'main' we call the IP
backend and initialise a Transport object, representing the transport
backend for cloud Haskell:

initIpTransport :: Maybe FilePath -> IO Transport

Then when we run a process we pass in the transport backend:

runProcess :: Transport -> ProcessM () -> IO ()

Note that this approach allows different backends to take
backend-specific configuration information (which is again similar to
the design used by Haskell database libs).

Network and neighbour setup
===========================

One of the slightly tricky issues with writing a program for a cluster
is how to initialise everything in the first place: how to get each node
talking to the appropriate neighbours.

With the interfaces mentioned thus far we cannot actually communicate
with anyone as there is no way to obtain a NodeId of a neighbour.

The current cloud Haskell interface provides:

type PeerInfo = Map String [NodeId]
getPeers :: ProcessM PeerInfo
findPeerByRole :: PeerInfo -> String -> [NodeId]

The current implementation obtains this information using magic and
configuration files.

We discussed various designs. One involved the network transport
interface providing:

getAllPeers :: [SendEnd]

Another design was to provide

connect :: String -> IO SendEnd

where the string is some network transport-specific way of addressing a
node (much like database connection strings). This string would have to
be obtained from some initial configuration (e.g. config files and/or
input from a job scheduler).

The idea here is that all the network transport backends provide this
same interface. Though of course the interpretation of the connection
string is backend-specific.

Duncan's alternative suggestion
-------------------------------

My intuition, not necessarily shared by others, is that finding
appropriate initial neighbours is quite specific to the backend and
indeed specific to the kind of application one is writing. Peter Braam
pointed out that finding all neighbours in a 10,000 node cluster is not
likely to be a useful thing to do, and that a typical application in
such a setup is likely to only connect to a few neighbours (perhaps
determined by a cluster job scheduler).

So one option is that we don't add neighbour discovery into the generic
transport interface, and indeed don't add it generically in the cloud
Haskell library. Rather we push initial neighbour discovery into the
cloud Haskell backends. This relies on (or at least works more naturally
with) the dynamic backend model described above.

This is again similar to the database model where the generic database
interface provides no way to get an initial connection to a database.
The backends provide the methods to make the initial connection(s),
specifying details like addresses, authentication and various
db-specific settings.

For example, here is a hypothetical Main module using an IP backend:

import Control.Distributed.Process
import Control.Distributed.Process.Backend.IP (initIpTransport)
import System.ClusterJobScheduler (getJobConfig, getMasterNode)
-- note: this job scheduler is specific to the IP backend,
-- other designs are possible

main = do
config <- getJobConfig
trans <- initIpTransport config
master <- getMasterNode trans
runProcess trans (go master)

-- we're master:
go Nothing = do
-- expect to receive messages from our workers

-- we've been given a master to contact:
go (Just master) = do
-- contact our master node to let them know we exist

This approach would mean that we don't need any generic interface for
finding peers. The backend can just provide whatever is appropriate and
it can construct the appropriate connections to make one or more NodeIds
that can then be used by the application for bootstrapping the
application-level communication network.

This also pushes issues like dynamic vs static discovery or addition of
nodes into the backend where it can probably be addressed more directly
than providing some generic API. It doesn't preclude a generic api, but
it does allow backends to provide extra or more specific discovery or
configuration mechanisms.

Network transport interface, part #2
====================================

Suppose we take the dynamic backend approach. What would the Transport
object look like?

It could (almost) be as simple as:

data Transport = Transport {
newConn :: IO (SendAddr, ReceiveEnd)
}

which of course begs the question of what the SendAddr, SendEnd and
ReceiveEnd look like. These could (almost) be as simple as:

newtype SendAddr = SendAddr { connect :: IO SendEnd }
newtype SendEnd = SendEnd { send :: ByteString -> IO () }
newtype ReceiveEnd = ReceiveEnd { receive :: IO ByteString }

were it not for the fact that of course 'SendAddr's need to be
serialisable.

For serialisation we could add it to the SendAddr:

data SendAddr = SendAddr {
connect :: IO SendEnd
serialiseSendAddr :: ByteString
}

while for deserialisation we could add it to the Transport:

data Transport = Transport {
newConn :: IO (SendAddr, ReceiveEnd),
deserialiseSendAddr :: ByteString -> SendAddr
}

I have a prototype of this Transport interface, with a trivial in-memory
implementation using MVars. In addition I have an early prototype of a
simple Process layer built on top of the Transport layer.
The purpose of this prototype is to check that the design is feasible. I
intend to add serialisation/deserialisation of the SendAddr and
ProcessID to the prototype and then share the code for further
discussion.

--
Duncan Coutts, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com/

Simon Peyton-Jones

unread,

Nov 2, 2011, 6:30:37 PM11/2/11

to dun...@well-typed.com, parallel...@googlegroups.com

| Thus a critical design decision is whether deserialisation of a send
| endpoint is pure or if it is in IO. If it is in IO then a SendEnd can be
| a stateful object (e.g. could hold an IP socket). On the other hand if
| deserialisation is pure then a SendEnd can only be a piece of data and
| it would be 'send' that would have to do all the work of initialising
| any stateful network connections.

...

| Thus my suggestion is that we have:
|
| newConn :: IO (SendAddr, ReceiveEnd)
| connect :: SendAddr -> IO SendEnd

Wait! There need be no relationship between
* the type of newConn
* whether SendEnd serialisation/deserialisation is pure

An alternative design would have
newConn :: IO (SendEnd, ReceiveEnd)
but make serialiation and deserialisation impure.

However closure construction must be pure. So this alternative design would be:

data Closure a where
MkClo :: IO ByteString -> (ByteString -> IO a) -> Closure a

and the serialisation class would look like:

class Serialise a where
serialise :: a -> IO ByteString
deserialise :: ByteString -> IO a

Now you don't need to make the user think about connect; the of deserialising a SendEnd would do the connection setup.

I'm unsure whether all this is worth it. In you "connect-then-send" model, it's likely that people will just wrap them together so that a connect precedes every send. Presumably connect consults the connection cache. So now you are back to square 1. Is it really such a big deal to consult the cache before every send?

Anyway the interface I describe would let the library author decide whether to establish a connection at deserialisation, or at send. It would not allow doing so at some *other* time, which yours would. But in exchange it's simpler; and I would be unsurprised if having serialisation in IO proved useful in other ways.

Simon

Ryan Newton

unread,

Nov 3, 2011, 1:43:58 PM11/3/11

to dun...@well-typed.com, parallel...@googlegroups.com

Thanks, Duncan for writing all this up!

I think it's critical to have this kind of lower level library for
implementing higher-level systems built on top of distributed Haskell.
We have already run into the problem that CloudHaskell is too
high-level and too much of an end-user package to always be the best
for building other systems. For example, its message receiving
mechanism has the whole "matching" functionality built into it, and
even the simple 'expect' is defined in terms of the matching receive:

http://hackage.haskell.org/packages/archive/remote/0.1/doc/html/src/Remote-Process.html#expect

It sounds like expect could turn into just a "receive" in Duncan's
outlined API. By the way, it may help to make clear the
synchronous/asynchronous properties in the CloudHaskell documentation.
For example "send" right now doesn't say:

http://hackage.haskell.org/packages/archive/remote/0.1/doc/html/Remote-Process.html#g:3

So hurrah for the lower level, but portable, messaging library.
However, while a low-level library is welcome, it may be good to leave
some room for innovation under, as well as on top of, this API.
Probably this means putting extra function variants in the API, even
if they aren't implemented well at first. In addition to
"newConnUnreliable" that Duncan mentioned, maybe it would be
interesting to have a version that works on lazy bytestrings.
Related aside: Another guy at my institution (Martin Swany) works on
network performance analysis and optimizing use of the network by
program transformation. They've got a neat system that, using program
analysis, automatically refactors MPI-codes to do autotune message
size and send smaller messages earlier (rather than waiting until the
whole blob is ready). In Haskell we could at least open up the
*possibilitiy* of playing with that sort of optimization just by
having a lazy-bytestring or Iteratee-based interface for sending large
messages.

> getAllPeers :: [SendEnd]

We definitely want the possibility of a dynamic set of peers (unlike
MPI), right? So you wouldn't want exactly this type...

I like "Duncan's alternative suggestion" where the peer configuration
may be in its own package (System.ClusterJobScheduler -- separate,
even, from CloudHaskell backends). It probably is complex enough that
it deserves a careful design and its own package(s). I also like that
*easy* option you mentioned of SSH based connections. Good to have at
least one easy-entry option where people can get going with the system
in <30 min.

Cheers,
-Ryan

Peter Braam

unread,

Nov 3, 2011, 2:56:04 PM11/3/11

to dun...@well-typed.com, parallel...@googlegroups.com

Hi Duncan -

You beat me to writing it up! I have quite a few comments. I first inserted them into your message, but given its length I decided it was better to keep them together.

TRANSPORTS

I would like to add the following. There are "transports" (as Duncan calls them) that focus purely on transporting data. Examples are:

TCP/IP, UDP

IB/Verbs (or one of their other 10 protocols or so)

CCI

Portals

Pipes

Numa transports

PCI transports

Portals and CCI are unique in attempting to provide 100% consistent semantics over multiple lower protocols. Portals is much fatter than CCI.

Then there are transports embedded in generally much larger libraries, and layered on one or more of the above. Examples are:

MPI

ZeroMQ

ssh

SSL sockets

http connections

RPC connections (many flavors)

Generally these 2nd group of transports have the following attributes:

- there is enormous overhead (e.g. MPI easily has 2x latency overhead over CCI or IB. IP is so slow that it MPI/IP probably has negligible overhead over IP)

- at extreme scale, many of the protocols above have problems (e.g.: Large web applications are usually enormous trees of Squid proxies. MPI is being displaced by a proprietary protocol from ETI called SWARM to scale much better on the Graph500 benchmark)

- there are enormous libraries of functionality available (MPI is the best example - it has 100's of calls that are not directly transport related). Unless re-writing those libraries from scratch is desirable, it's nice to have access to all of it.

- the failure semantics are very different from that of the underlying low level transport. E.g. one failure takes and entire MPI process group down (normally all nodes running a job, refinements to this are coming but not yet stable), most RPC protocols have "their own" timeouts

In my experience the different failure semantics make it extremely difficult to put a "fat" protocol in place of a thin one merely for the purpose of moving packets. If you use a fat one, you want to be using the library of additional routines.

There are some exceptions, like IPSEC tunnels etc, but on the whole the security semantics makes at least the connection API quite different involving interaction with key / certificate management systems, and in many cases (GSS-API) every packet needs special purpose processing, that can be application dependent. I think that is best seen as an application leveraging the primitives we introduce.

The basic thing we want is point-to-point send & receive of variable
size blobs of bytes. We will put off consideration of other services
like broadcast.

Unfortunately I think you can't put this off, but maybe the definitions you give here are good enough to capture it. A scalable client server implementation will insist on many (clients) to one (server) communication and want a single request buffer. The same holds in reverse for multicast to groups of hosts.

We need some notion of network endpoint and/or address.

ENDPOINTS / ADDRESSES

Yes, and I think we agreed that these are two very different concepts. The endpoint as I understand it is a networking level object. The address is a resource name. Connections are networking level objects (that tie two endpoints). See my discussion below on resources.

REVIEW OF CURRENT PROGRAMMING PRACTICES IN C

In order to help verifying the types under discussion I reviewed what is currently done in C applications. The sequence of setting up communications is as follows:

1. get a list of networking devices (these are normally configured devices that have an address of some kind and a driver in the OS/libraries), select a device.

2. bind an endpoint to the device. This ties one or more addresses to the endpoint. (Extra parameters are often required, e.g. buffer sizes)

server:

3. wait for connection events to come in to the endpoint

4. accept the connection and tie it to the same or another endpoint

client:

3. translate a resource name to an internal address (see below)

4. send a connect message from the endpoint (2) to the address (the connect message often carries other parameters)

5. wait for a response that the endpoints are connected

What we see here is that:

(1) endpoint are bound to devices

(2) connect packets are sent from source endpoints to addresses

(3) connect packets are received on destination endpoints named by addresses

If the connection phase is successful, other packets can be sent on the connection. Typically connections can be used bi-directionally and people generally hate that and like unidirectional connections. Connections have a lot of internal attributes related to ordering (or not), reliability (or not), buffering and timeouts. The sophistication of connection associated data cannot be over estimated - read the TCP Reno papers as an example (I think it discussed things like acknowledgement vectors, slow-start, and a variety of complicated behavioral bits). Most ad-hoc implementations of connections over UDP have run into usability issues in the WAN.

I want to mention that the terminology "send endpoint" and "receive endpoint" is not one we should adopt because they are ambiguous (many people think the "endpoint" is remote). Better words are "source" (where packets depart) and "dest" (where packets arrive). (Even this naming is "polluted" because when source is used in conjunction with sink it has the opposite meaning.) The confusion is best illustrated by Duncan's wording below (which I believe is the opposite of mine, by accident.)

The following patterns are common:

1. a small message is sent directly over the connection (sometimes delivered with an event, sometimes inside an event)

2. a small packet is used to communicate a DMA address in the memory of the remote side of the connection. This remote address can always be used to send data to (using a local DMA address) and by some API's (more efficient) to read from. The C-api is:

dma_send(source_side_addr, dest_side_addr, buflen, buffer *)

3. a vectorized (so called iov) version of (2) where multiple source buffers are registered for DMA and transported to the destination.

Transports (2) and (3) go at bus speeds and involve no CPU on advanced networks. The buffers we send with DMA are typically not serialized (and copying or packaging in any form is to be avoided). Multiple DMA's may be active over an endpoint simultaneously (this is called multiplexing). Can the functions above get extra parameters for the iov addresses? Not having iov capability is a nono.

Secondly, in C api's typically all results are communicated through events. Waiting for events is done in two ways (1) spinning and (2) sleeping. In IP networking the event is found in return codes and modified parameters (bitmaps in select and poll for example). In advanced communication libraries event packages are placed in buffers by the networking hardware. Sleeping is very very harmful to performance, but spinning is not used much outside of HPC to my knowledge. The key harm caused by sleeping lies in the fact that upon arrival of the event the scheduler may run the waiting process on a different CPU blowing all the caches. A key question is if the Haskell thread implementation can avoid this harm? If it can't, the blocking nature of the send and receive functions discussed above will prevent us from multiplexing communications over one connection appropriately (a nono).

RESOURCES

Finally I want to come back to the issue of resources. Today it is probably still true that some kind of process running on a node is responsible for communications, although MPI has replaced that with a communicator which is a set of nodes that is communicating (and fails as a group). This is about to come to an end, with communications to GPU's, FPGA's and switches becoming first class citizens and with resilience requiring possibly transparent migration of such entities.

Above I think I touched on the data constructors for endpoints. The question here is if endpoints are standalone or always associated with a Haskell execution environment? I think we need to be carefully thinking if that is compatible with the paradigms / examples we have begun to discuss. If we want to use pure TCP sockets to write a web client, we will be sending messages to non-Haskell programs - I think we want send and receive (with a raw TCP transport) to be useful for that?

Secondly the issue of strings as addresses isn't sitting well with me. Different transports will accept different types of addresses, and it's merely the fact that we are dealing with the C department that they are currently always packed in a string or integer perhaps. Most transports send these addresses through a naming service that translates them into some other format.

An application framework (MPI, Hadoop, Erlang-like, Cloud-Haskell) appears to sometimes have their own address space (MPI rank - the process number - being a good example). I know that CCI is using a URI based addressing, and discussing a naming service that can accomodate URI's for the application frameworks they want to support (e.g. job schedulers reporting node and process id).

Ryan brought up this issue also, I think that we should be careful to stuff the minimum into the transport API's and leave the remainder to a different piece of software that can communicate to us things that particular frameworks want. I'm not quite sure what I'm saying here in terms of types.

I hope this is helpful - it is clearly only a part of the puzzle.

Regards,

Peter

These will of

course have to be abstract since the details will vary between network
backends. The endpoints are used for sending and receiving messages:

send :: SendEnd -> ByteString -> IO ()
receive :: ReceiveEnd -> IO ByteString

For reasons we will detail shortly, we follow the cloud Haskell design
of separating the types of the send and receive endpoints.

Send endpoint/address serialisation
-----------------------------------

We want to be able to copy a send endpoint anywhere in the network and
use it to send messages to the receive endpoint. This requires that send
endpoints be serialisable. However, most network endpoints are
associated with some resources, and are not simply an address as a piece
of data.

After connection establishment, yes.

Thus a critical design decision is whether deserialisation of a send
endpoint is pure or if it is in IO. If it is in IO then a SendEnd can be
a stateful object (e.g. could hold an IP socket). On the other hand if
deserialisation is pure then a SendEnd can only be a piece of data and
it would be 'send' that would have to do all the work of initialising
any stateful network connections.

Doesn't the send endpoints communicate success or failure (even for connects)? I would assume that the events are tied to the endpoint? Given that someone could cut the cable, I think it has state.

Wow - this typedef shows how extremely confusing the naming with Send and Receive is - didn't I just write the opposite above?

Don't many protocols have an MTU?

* messages delivered in-order
For messages sent on the same SendEnd (but not
necessarily SendAddr).

Ordered delivery is only available for TCP and becomes a huge headache with routing and channel bonding (different fragments of messages follow different paths) and re-assembly is complicated.

* reliable
to the degree that messages are delivered at most once
and subsequent messages are not delivered until earlier
ones are

This mixes ordering and reliability.

* somewhat asynchronous send permitted:
* send is not synchronous, send completing does not imply
sucessful delivery
* send side buffering is permitted but not required
* receive side buffering is permitted but not required
* send may block (e.g. if too much data is in flight or
destination buffers are full)

Many protocols need to deliver an error, blocking is not an option (if, for example, all memory has been used).

* mismatched send/receive permitted
It is not an error to send without a thread at the other
end already waiting in receive (but it may block).

These properties are based on what we can get with (or build on top of)
tcp/ip, udp/ip, unix IPC, MPI and the CCI HPC transport. In particular
CCI emphasises the property that a node should be able to operate with
receive buffer size that is independent of the number of
connections/nodes it communicates with (unlike tcp/ip which has a buffer
per connection). Also, CCI allows unexpected receipt of small messages
but requires pre-arrangement for large transfers (so the receive side
can prepare buffers).

This is a pretty general purpose connection type, supporting as it does
reliable ordered delivery of arbitrary sized messages. We may wish in
future to consider adding extra functions for creating more special-case
connections, e.g. unordered, unreliable or limitations on message size.
e.g.

I think CCI offers these options, but ordering is not desirable and sending small messages larger than the MTU is not permitted, and buffer overflow is a (silent) error.

Duncan Coutts

unread,

Nov 3, 2011, 5:51:12 PM11/3/11

to Simon Peyton-Jones, parallel...@googlegroups.com

On Wed, 2011-11-02 at 22:30 +0000, Simon Peyton-Jones wrote:
> | Thus a critical design decision is whether deserialisation of a send
> | endpoint is pure or if it is in IO. If it is in IO then a SendEnd can be
> | a stateful object (e.g. could hold an IP socket). On the other hand if
> | deserialisation is pure then a SendEnd can only be a piece of data and
> | it would be 'send' that would have to do all the work of initialising
> | any stateful network connections.
> ...
> | Thus my suggestion is that we have:
> |
> | newConn :: IO (SendAddr, ReceiveEnd)
> | connect :: SendAddr -> IO SendEnd
>
> Wait! There need be no relationship between
> * the type of newConn
> * whether SendEnd serialisation/deserialisation is pure
>
> An alternative design would have
> newConn :: IO (SendEnd, ReceiveEnd)
> but make serialiation and deserialisation impure.

I considered this. It's effectively the same thing. We would have:

newConn :: IO (SendEnd, ReceiveEnd)

serialiseSendEnd :: SendEnd -> ByteString
deserialiseSendEnd :: ByteString -> IO SendEnd

If you read ByteString as SendAddr and deserialiseSendEnd as connect
then these are essentially the same. Though strictly speaking you'd also
have:

newConn :: IO (ByteString, ReceiveEnd)

So why call it connect rather than deserialise, and why have newConn
produce the address as data rather than giving a SendEnd directly?

It's because I think it more accurately reflects the costs of the
operations. The implementations I'm thinking of allowing involve
SendEnd's being handles on IO resources (e.g. a Haskell Handle or
Socket). So it's then not really true to call it just deserialisation,
we're allocating an expensive resource and potentially setting up a
network connection. We're making more explicit the cost in establishing
the connection. Otherwise that is hidden in the first time you use
'send'.

Then we are front loading the cost then it also makes sense not to do
that immediately when we call newConn, since it's quite likely that
we'll only ever pass the address off to remote nodes and never use it
locally. If we want to use it locally, we can do so explicitly.

> However closure construction must be pure.

Yes. But that's all at the cloud Haskell layer. It doesn't mean it has
to be done the same way at the transport layer, so long as what we do at
the transport layer doesn't make what we want at the cloud Haskell layer
impossible, which I don't think it does. We can have impure
deserialisation/connect for the transport endpoints but still have pure
deserialisation of cloud Haskell ProcessIds and SendPorts. It just means
that cloud Haskell has to pay the cost of keeping a cache mapping
transport send addresses to send endpoints, which is exactly what it
does now (ip addresses to open sockets).

> So this alternative design would be:
>
> data Closure a where
> MkClo :: IO ByteString -> (ByteString -> IO a) -> Closure a
>
> and the serialisation class would look like:
>
> class Serialise a where
> serialise :: a -> IO ByteString
> deserialise :: ByteString -> IO a
>
> Now you don't need to make the user think about connect; the of
> deserialising a SendEnd would do the connection setup.

If we're talking about the cloud Haskell user, indeed no they will never
have to think about connect.

And yes it would be interesting to see if we could reduce the overheads
(eliminating the endpoint cache) by making the deserialisation impure as
you suggest. It's something I was hinting at at the end of my long
email, but I didn't put any details down. For the moment I was presuming
we'd keep the current pure serialisation + cache approach.

> I'm unsure whether all this is worth it. In you "connect-then-send"
> model, it's likely that people will just wrap them together so that a
> connect precedes every send. Presumably connect consults the
> connection cache. So now you are back to square 1. Is it really
> such a big deal to consult the cache before every send?

No it's not such a big deal. It's what cloud Haskell does right now, and
would continue to do under my suggestion. It just means that the costs
are explicit to the user of the transport layer (ie cloud haskell and
other middleware) and not all users of the transport layer necessarily
have to pay the cost of a connection cache.

> Anyway the interface I describe would let the library author decide
> whether to establish a connection at deserialisation, or at send. It
> would not allow doing so at some *other* time, which yours would. But
> in exchange it's simpler; and I would be unsurprised if having
> serialisation in IO proved useful in other ways.

So I think in conclusion we're actually talking about the same thing and
we agree. :-)

Dmitry Astapov

unread,

Nov 3, 2011, 6:07:53 PM11/3/11

to rrne...@gmail.com, dun...@well-typed.com, parallel...@googlegroups.com

On Thu, Nov 3, 2011 at 5:43 PM, Ryan Newton <rrne...@gmail.com> wrote:

> getAllPeers :: [SendEnd]

We definitely want the possibility of a dynamic set of peers (unlike
MPI), right? So you wouldn't want exactly this type...

It may sound as nitpicking, but MPI supports dynamic sets of peers (MPI_Comm_spawn).

--
Dmitry Astapov

Peter Braam

unread,

Nov 3, 2011, 6:40:41 PM11/3/11

to dun...@well-typed.com, Simon Peyton-Jones, parallel...@googlegroups.com

Hi again -

Connect and/or endpoint creation have other parameters (e.g. buffered vs eager, ordered, reliable). How are you passing these?

I fear that users should be able to specify these choices, at least if they wish.

Peter

Duncan Coutts

unread,

Nov 3, 2011, 7:09:19 PM11/3/11

to rrne...@gmail.com, parallel...@googlegroups.com

On Thu, 2011-11-03 at 13:43 -0400, Ryan Newton wrote:
> Thanks, Duncan for writing all this up!
>
> I think it's critical to have this kind of lower level library for
> implementing higher-level systems built on top of distributed Haskell.
> We have already run into the problem that CloudHaskell is too
> high-level and too much of an end-user package to always be the best
> for building other systems.

Right, since it's modelling the erlang actor approach.

> For example, its message receiving
> mechanism has the whole "matching" functionality built into it, and
> even the simple 'expect' is defined in terms of the matching receive:
>
> http://hackage.haskell.org/packages/archive/remote/0.1/doc/html/src/Remote-Process.html#expect

> It sounds like expect could turn into just a "receive" in Duncan's
> outlined API.

No, 'expect' returns a message of the type you're expecting, not the
next message to arrive (which could be of any type). On the other hand,
waiting on a channel (without any channel merging) could be just a
'receive'.

> So hurrah for the lower level, but portable, messaging library.
> However, while a low-level library is welcome, it may be good to leave
> some room for innovation under, as well as on top of, this API.
> Probably this means putting extra function variants in the API, even
> if they aren't implemented well at first. In addition to
> "newConnUnreliable" that Duncan mentioned, maybe it would be
> interesting to have a version that works on lazy bytestrings.

So I'm not quite sure if you mean a vectored send where you have a
single logical message defined as the concatenation of a bunch of
buffers, or if you mean something like sending a sequence of messages.
Or maybe you mean sending a single logical message but not necessarily
requiring the whole sequence of chunks to be present in memory at once
(as in traditional vectored send) but building up a single logical
message piece by piece. Would that have any network performance
advantages over doing things at the message level and sending a stream
using multiple logical messages?

Certainly it would be useful to provide the traditional vectored send.
For large messages generated by binary serialisation it's cheaper to
produce a sequence of chunks than to produce one massive chunk.

> Related aside: Another guy at my institution (Martin Swany) works on
> network performance analysis and optimizing use of the network by
> program transformation. They've got a neat system that, using program
> analysis, automatically refactors MPI-codes to do autotune message
> size and send smaller messages earlier (rather than waiting until the
> whole blob is ready). In Haskell we could at least open up the
> *possibilitiy* of playing with that sort of optimization just by
> having a lazy-bytestring or Iteratee-based interface for sending large
> messages.

Their system sounds like they're packing multiple logical messages into
fewer network sends, in which case putting smaller messages first would
get them sent out earlier. I'm not sure if/how that applies in our
context.

> > getAllPeers :: [SendEnd]
>
> We definitely want the possibility of a dynamic set of peers (unlike
> MPI), right? So you wouldn't want exactly this type...

Oh I think I left off the IO in that, I think the suggestion was

getAllPeers :: IO [SendEnd]

so it might give you different results if you call it later. Of course
this is still horribly limited since there's no notification, and it's
more than you want since it tells you all peers, rather than specific
ones.

> I like "Duncan's alternative suggestion" where the peer configuration
> may be in its own package (System.ClusterJobScheduler -- separate,
> even, from CloudHaskell backends). It probably is complex enough that
> it deserves a careful design and its own package(s).

Good, I'm glad it's not just me :-)

> I also like that *easy* option you mentioned of SSH based connections.
> Good to have at least one easy-entry option where people can get going
> with the system in <30 min.

It's pie in the sky of course, but yes I thought it might provide a
convenient low performance option. I know that's how people used to use
the informal cluster of desktop machines at my university. It's ok for
embarrassingly parallel jobs where there's a single master.

Duncan Coutts

unread,

Nov 3, 2011, 7:38:26 PM11/3/11

to Peter Braam, parallel...@googlegroups.com

On Thu, 2011-11-03 at 16:40 -0600, Peter Braam wrote:
> Hi again -
>
> Connect and/or endpoint creation have other parameters (e.g. buffered vs
> eager, ordered, reliable). How are you passing these?
>
> I fear that users should be able to specify these choices, at least if they
> wish.

I mentioned that we could consider providing

newConnUnreliable :: IO (SendEnd, ReceiveEnd)

From my point of view, reliable and unreliable connections are really
different kinds of things (possibly but not necessarily different
types). The same applies to ordered vs unordered.

BTW, am I right in thinking that the only combinations that make sense
are:
* ordered & reliable
* unordered & reliable
* unordered & unreliable

That is, ordered & unreliable doesn't really exist, or isn't useful?

So with that point of view, it's really newConn* and not connect that
should determine the kind of connection. It probably does not make sense
for one end of a connection to think it's reliable while the other end
thinks it unreliable.

On the other hand, whether the sending end wants to buffer or send
messages eagerly is a local decision. So that could be set at connect
time.

Note that it's not necessarily essential that we provide these various
options on the newConn and connect functions. That's only necessary if
applications or middleware (like cloud haskell) want to specify these
options differently for different connections in the same app context.
Otherwise it'd be possible to set these options globally for the whole
transport at the point at which the transport is initialised.

If we do provide them, we'll have to be clear about if and how they
affect the meaning. For example I expect that the buffered vs eager
can't actually change the semantics and that it's just a hint which the
transport implementation may or may not heed.

Ryan Newton

unread,

Nov 4, 2011, 12:43:32 AM11/4/11

to Duncan Coutts, parallel...@googlegroups.com

>> It sounds like expect could turn into just a "receive" in Duncan's
>> outlined API.
>
> No, 'expect' returns a message of the type you're expecting, not the
> next message to arrive (which could be of any type). On the other hand,
> waiting on a channel (without any channel merging) could be just a
> 'receive'.

Eek. That was a mental error on my part -- with CloudHaskell paged
out of my brain I forgot the typed channel vs. process expect/send
distinction and made the above erroneous comment.

> Or maybe you mean sending a single logical message but not necessarily
> requiring the whole sequence of chunks to be present in memory at once
> (as in traditional vectored send) but building up a single logical
> message piece by piece. Would that have any network performance
> advantages over doing things at the message level and sending a stream
> using multiple logical messages?

Yep, if I send a single 500MB lazy Bytestring as logical send, it is
probably better to send the smaller pieces as the chunks become
available.

> Their system sounds like they're packing multiple logical messages into
> fewer network sends,

Actually, both that and the *opposite*. It's intuitive that batching
small messages can sometimes help. But they have also found that MPI
programmers often do this to the point of overkill. Therefore there
tool can get performance improvements by moving MPI calls out around
and out of for-loops to transform a program that sends a big message
into one that sends smaller messages [starting earlier].

I thought this might be a place where laziness would allow us to at
least potentially autotune the amount of your lazy bytestring that
gets sent per network message. (You can split a strict bytestring up
into multiple messages if you like, but you can't overlap sends with
the computation than produces the bytestring.)

Cheers,
-Ryan

Ryan Newton

unread,

Nov 4, 2011, 12:45:43 AM11/4/11

to dun...@well-typed.com, Peter Braam, parallel...@googlegroups.com

> BTW, am I right in thinking that the only combinations that make sense
> are:
> * ordered & reliable
> * unordered & reliable
> * unordered & unreliable

Probably overkill, but if this list does grow longer yielding more
total combinations I suppose we could represent it through phantom
type parameters and type families...

Simon Peyton-Jones

unread,

Nov 4, 2011, 9:23:27 AM11/4/11

to Duncan Coutts, parallel...@googlegroups.com

| you suggest. It's something I was hinting at at the end of my long
| email, but I didn't put any details down. For the moment I was presuming
| we'd keep the current pure serialisation + cache approach.

| So I think in conclusion we're actually talking about the same thing and
| we agree. :-)

Well maybe not. My suggestion is: explore a design in which
* There is no distinction between SendAddr and SendEnd
* Serialisation/deserialisation is impure

Your suggestion is "presume we keep pure serialisation". But if the approach worked we could simplify the API by abolishing the proposed distinction between SendEnd and SendAddr.

Simon

Duncan Coutts

unread,

Nov 4, 2011, 11:19:35 AM11/4/11

to Simon Peyton-Jones, parallel...@googlegroups.com

Ok, but we are at least talking about things that are more or less
equivalent. The issue is why I've stuck an intermediate type in between
SendEnd and the serialised representation as a ByteString.

newConn :: IO (SendEnd, ReceiveEnd)
serialise :: SendEnd -> ByteString
deserialise :: ByteString -> IO SendEnd --equiv of connect below

newConn :: (SendAddr, ReceiveEnd)
connect :: SendAddr -> IO SendEnd --equiv of deserialise above
serialise :: SendAddr -> ByteString
deserialise :: ByteString -> SendAddr

So what are the expected costs of these functions and IO actions?

The costs that need to be allocated are (roughly):
A. parsing/validation of network address (cheap)
B. making connection endpoint resource (e.g. calling C
socket()/bind())
C. establishing connection by sending out packets (e.g. calling C
connect())

Costs B and C can be distributed between:
* deserialise / connect call
* first 'send' on a freshly made SendEnd

We already talked about a model with pure deserialisation where all the
cost had to be put in the initial 'send' call (and it incurred a cost
for each subsequent 'send' to look up the underlying real connection.)

With impure deserialisation we can choose which of B and C we want to
include, neither, just B or both B and C. All costs not paid at this
stage are paid by the first 'send'. We could avoid paying B and C by
allocating an MVar/IORef initialised to some null value, and only
filling in the real object when the first 'send' is done.

Of all the costs, C is the greatest. I think it makes sense to be
explicit about when that cost is paid. I think it's also fairly likely
that we might not want to pay that cost as soon as we receive a
serialised address from a peer, but rather nearer to the time when we
are ready to start communicating.

Additionally, I think it makes sense to be able to pay the connection
costs separately from the initial message send. Establishing connections
is typically synchronous and hence more expensive than sending ordinary
messages. There is also a error handling angle. We find out at
connection time, before sending application layer messages, if the peer
exists at all. So there's a potential advantage to being able to do that
without having to use send.

Of course, with the first model apps/midleware can still do that by just
keeping the serialised ByteString around rather than converting it to a
SendEnd, but if it's going to hang around for a while then it's probably
preferable to give it its own type.

Secondly, if cost C is paid by deserialise/connect rather than the first
'send' then it also makes sense for newConn to give us the
not-yet-connected form because it's highly likely that a node is not
going to communicate with itself, but is going to send the address to a
remote node. The same applies to cost B in fact.

So the thing that the distinct SendAddr provides is that we can account
the expensive B & C costs to 'connect' and we can pay them any time in
between deserialisation and send time.

Peter Braam

unread,

Nov 4, 2011, 3:47:35 PM11/4/11

to Duncan Coutts, parallel...@googlegroups.com

Hi Duncan -

On Thu, Nov 3, 2011 at 5:38 PM, Duncan Coutts <dun...@well-typed.com> wrote:

On Thu, 2011-11-03 at 16:40 -0600, Peter Braam wrote:
> Hi again -
>
> Connect and/or endpoint creation have other parameters (e.g. buffered vs
> eager, ordered, reliable). How are you passing these?
>
> I fear that users should be able to specify these choices, at least if they
> wish.

I mentioned that we could consider providing

newConnUnreliable :: IO (SendEnd, ReceiveEnd)

From my point of view, reliable and unreliable connections are really
different kinds of things (possibly but not necessarily different
types). The same applies to ordered vs unordered.

BTW, am I right in thinking that the only combinations that make sense
are:
* ordered & reliable
* unordered & reliable
* unordered & unreliable

That's right. Staying with the IP socket case, your newConn* commands also include calls to bind(2), listen(2), accept(2) and setsockopt(2). These have numerous other parameters possibly. What we should avoid at all cost is that such parameters can only be set through configuration files - we need a programmatic interface for that.

A great but simplified overview of this is on an IBM page (which repeats my explanation, in their case for sockets)

http://www.ibm.com/developerworks/aix/library/au-tcpsystemcalls/

For high performance connections (probably including ethernet card with offload facilities) there is a legion of parameters, and while it may seem that they are exotic, they are definitely not unusual if you need to max out performances. For example large scale services reserve considerable amounts of memory to avoid running out of buffers (there is often no spare memory to allocating it dynamically is not always an option). All of those things become parameters. Uugh, I know.

In other words, I fear we want, as dirty as it may look, more parameters.

That is, ordered & unreliable doesn't really exist, or isn't useful?

Correct.

So with that point of view, it's really newConn* and not connect that
should determine the kind of connection. It probably does not make sense
for one end of a connection to think it's reliable while the other end
thinks it unreliable.

Actually some parameters like the MTU are negotiated upon connection time by the lower layers, but in some API's passed in.

On the other hand, whether the sending end wants to buffer or send
messages eagerly is a local decision. So that could be set at connect
time.

Note that it's not necessarily essential that we provide these various
options on the newConn and connect functions. That's only necessary if
applications or middleware (like cloud haskell) want to specify these
options differently for different connections in the same app context.
Otherwise it'd be possible to set these options globally for the whole
transport at the point at which the transport is initialised.

If we do provide them, we'll have to be clear about if and how they
affect the meaning. For example I expect that the buffered vs eager
can't actually change the semantics and that it's just a hint which the
transport implementation may or may not heed.

If we make it possible, everyone will be happy. It's amazing that Erlang normally gets by without changing such parameters. However, the core Erlang developers I met told me that their messaging had been almost exclusively used in relatively small local area network clusters. The problems come from large clusters (one would have thought that Twitter or Facebook have seen some with Erlang) or wide area, and I could go on for hours describing scars and difficulties I've encountered, and usually overcome with all these crazy parameters.

Over to the type specialists to suggest how we do this.

Peter

Duncan Coutts

unread,

Nov 7, 2011, 1:00:49 PM11/7/11

to Peter Braam, parallel...@googlegroups.com

On Thu, 2011-11-03 at 12:56 -0600, Peter Braam wrote:
> Hi Duncan -
>
> You beat me to writing it up! I have quite a few comments. I first
> inserted them into your message, but given its length I decided it was
> better to keep them together.

Thanks for the comments.

BTW, I should make clear that I'll be putting this up on a wiki page and
I'll incorporate changes from this discussion.

So I take it you'd like this discussion on the meaning of transports,
endpoints / addresses etc to be included.

> TRANSPORTS

[...] about 'fat' transport libs

> - there are enormous libraries of functionality available (MPI is the best
> example - it has 100's of calls that are not directly transport related).
> Unless re-writing those libraries from scratch is desirable, it's nice to
> have access to all of it.
> - the failure semantics are very different from that of the underlying low
> level transport. E.g. one failure takes and entire MPI process group down
> (normally all nodes running a job, refinements to this are coming but not
> yet stable), most RPC protocols have "their own" timeouts
>
> In my experience the different failure semantics make it extremely
> difficult to put a "fat" protocol in place of a thin one merely for the
> purpose of moving packets. If you use a fat one, you want to be using the
> library of additional routines.

So what is your conclusion here? That it doesn't make sense to implement
this transport API on top of "fat" libs like MPI? In other words, that
if one wants to use MPI that they should use the binding directly which
provides all the functionality.

> There are some exceptions, like IPSEC tunnels etc, but on the whole the
> security semantics makes at least the connection API quite different
> involving interaction with key / certificate management systems, and in
> many cases (GSS-API) every packet needs special purpose processing, that
> can be application dependent. I think that is best seen as an application
> leveraging the primitives we introduce.

Yeah, I've no idea at the moment if we could make fit this transport
interface over the protocols with security. The problem is that for a
generic interface we can't easily expose special purpose functionality.

With this transport design we can pass in any special configuration,
settings etc when we initialise the transport, but not subsequently when
using it, ie when creating and using connections.

> > The basic thing we want is point-to-point send & receive of variable
> > size blobs of bytes. We will put off consideration of other services
> > like broadcast.

> Unfortunately I think you can't put this off, but maybe the definitions you
> give here are good enough to capture it. A scalable client server
> implementation will insist on many (clients) to one (server) communication
> and want a single request buffer. The same holds in reverse for multicast
> to groups of hosts.

I was a bit unclear. I meant individual messages are sent
point-to-point, single sender and single receiver (ie not multicast).
But of course the address corresponding to the receiving end can be
copied across the network so multiple senders can send messages to the
same receiver. So that covers the many to one case but it doesn't cover
the multicast case of one to many.

It's multicast that I've put off, mainly because from my point of view
it looks pretty tricky and its an area of greatest difference between
underlying transports. In particular it'd require that we support
unordered, unreliable connections with limited message size (since
that's all that multicast on most transports supports) and we've not
decided on that yet.

How important is multicast in our context? What does it get used for? I
know in IP applications like media streaming it's useful, but I can't
immediately think of use cases in our context.

> We need some notion of network endpoint and/or address.
>
>
> ENDPOINTS / ADDRESSES
>
> Yes, and I think we agreed that these are two very different concepts.
> The endpoint as I understand it is a networking level object. The address
> is a resource name. Connections are networking level objects (that tie two
> endpoints). See my discussion below on resources.

This touches on the discussion Simon and I were having about SendEnd vs
SendAddr and deserialisation.

> REVIEW OF CURRENT PROGRAMMING PRACTICES IN C
>
> In order to help verifying the types under discussion I reviewed what is
> currently done in C applications. The sequence of setting up
> communications is as follows:

So of course the interesting thing in comparing with the standard C APIs
is that they are working with a model with "well known" addresses, like
dns name + port 80. Our current design is more like the unix pipe()
function that creates two ends of an anonymous pipe, but where we can
move the sending end of the pipe anywhere in the network. So there's no
listening on well known addresses for incoming connections etc.

The anonymous pipe model is nice and simple, but it does rely on
existing connections to be able to pass the sending end of the pipe
around.

We have not yet adequately addressed the issue of establishing
connections in the first place. From the point of view of a cloud
Haskell app this is ok because we've said that the backend just magics
up some initial peers. Of course those cloud Haskell backends have to be
able to get their initial connections from somewhere, even if it's
transport-specific.

So if we did keep it transport-specific it'd probably look something
like:

Network.Transport.IP

which exports the

initIpTransport :: Config -> IO Transport

Remember, the Transport is the portable interface with the newChan,
send, receive etc.

Then it'd also export a few specific functions that let us create send
and receive ends for "well known" addresses, rather than just the
newChan which makes anonymous pipes.

> I want to mention that the terminology "send endpoint" and "receive
> endpoint" is not one we should adopt because they are ambiguous (many
> people think the "endpoint" is remote). Better words are "source" (where
> packets depart) and "dest" (where packets arrive). (Even this naming is
> "polluted" because when source is used in conjunction with sink it has the
> opposite meaning.) The confusion is best illustrated by Duncan's wording
> below (which I believe is the opposite of mine, by accident.)

Yeah, I don't like the terminology I've been using. There's so many
slightly misleading words: pipe, connection, channel, port. The remote
package uses SendPort and ReceivePort. During the discussion of the
transport layer I didn't want to confuse things by reusing these names.

Suggestions?

> The following patterns are common:
>
> 1. a small message is sent directly over the connection (sometimes
> delivered with an event, sometimes inside an event)
>
> 2. a small packet is used to communicate a DMA address in the memory of the
> remote side of the connection. This remote address can always be used to
> send data to (using a local DMA address) and by some API's (more efficient)
> to read from. The C-api is:
>
> dma_send(source_side_addr, dest_side_addr, buflen, buffer *)
>
> 3. a vectorized (so called iov) version of (2) where multiple source
> buffers are registered for DMA and transported to the destination.
>
> Transports (2) and (3) go at bus speeds and involve no CPU on advanced
> networks. The buffers we send with DMA are typically not serialized (and
> copying or packaging in any form is to be avoided). Multiple DMA's may be
> active over an endpoint simultaneously (this is called multiplexing).

I've been assuming that we can avoid having the transport layer's api
expose anything about thresholds between small messages and large
messages that use DMA. So a CCI transport impl would send small messages
in its usual way and for bigger messages, it'd decide based on best
performance if it should send a sequence of small messages or set up a
DMA transfer.

> Can the functions above get extra parameters for the iov addresses?
> Not having iov capability is a nono.

I'm happy with us providing a sendVec that uses a list of data chunks.
On the receive side our current design puts the transport layer in
charge of buffer allocation, we just get back the ByteString. If it
makes sense to handle vectored receive then we'd want to have a recieve
that gives the app a list of ByteString (or a lazy ByteString which is
essentially the same thing).

As I mentioned in another reply, sending a vector of chunks makes a lot
of sense given our binary serialisation libraries which produce output
in large chunks, rather than a single massive chunk. For example the new
ByteString builder monoid has some fairly clever threshold stuff so that
small chunks of data are copied into the output while large blobs of
data are linked rather than copied (the output being a list of buffers
-- a lazy ByteString).

> Secondly, in C api's typically all results are communicated through events.
> Waiting for events is done in two ways (1) spinning and (2) sleeping. In
> IP networking the event is found in return codes and modified parameters
> (bitmaps in select and poll for example). In advanced communication
> libraries event packages are placed in buffers by the networking hardware.
> Sleeping is very very harmful to performance, but spinning is not used
> much outside of HPC to my knowledge. The key harm caused by sleeping lies
> in the fact that upon arrival of the event the scheduler may run the
> waiting process on a different CPU blowing all the caches. A key
> question is if the Haskell thread implementation can avoid this harm? If
> it can't, the blocking nature of the send and receive functions discussed
> above will prevent us from multiplexing communications over one connection
> appropriately (a nono).

The Haskell thread scheduler doesn't move threads between cores very
aggressively, and it's possible to pin them.

> RESOURCES
>
> Finally I want to come back to the issue of resources. Today it is
> probably still true that some kind of process running on a node is
> responsible for communications, although MPI has replaced that with a
> communicator which is a set of nodes that is communicating (and fails as a
> group). This is about to come to an end, with communications to GPU's,
> FPGA's and switches becoming first class citizens and with resilience
> requiring possibly transparent migration of such entities.
>
> Above I think I touched on the data constructors for endpoints. The
> question here is if endpoints are standalone or always associated with a
> Haskell execution environment? I think we need to be carefully thinking if
> that is compatible with the paradigms / examples we have begun to discuss.
> If we want to use pure TCP sockets to write a web client, we will be
> sending messages to non-Haskell programs - I think we want send and receive
> (with a raw TCP transport) to be useful for that?

I don't think we should be specially targeting interoperability with
other protocols. For example I don't see any point in trying to make an
IP HTTP backend that can work with standard servers out there on the
web. If you want to do that, use an existing Haskell HTTP lib.

So I'd say that each transport is really defining it's own protocol. If
that happens to match something that is externally meaningful then
that's a bonus but I don't think we should be aiming at that.

> Secondly the issue of strings as addresses isn't sitting well with me.
> Different transports will accept different types of addresses, and it's
> merely the fact that we are dealing with the C department that they are
> currently always packed in a string or integer perhaps. Most transports
> send these addresses through a naming service that translates them into
> some other format.

In our current design the addresses are completely abstract. We use
newConn to create both ends of a pipe, one end has an associated address
that we can move around and use to make a connection. The only thing we
require is that addresses are serialisable.

If we have transport-specific functions for making connections for
well-known addresses then those could use domain-specific types for the
address.

> An application framework (MPI, Hadoop, Erlang-like, Cloud-Haskell) appears
> to sometimes have their own address space (MPI rank - the process number -
> being a good example). I know that CCI is using a URI based addressing,
> and discussing a naming service that can accomodate URI's for the
> application frameworks they want to support (e.g. job schedulers reporting
> node and process id).
>
> Ryan brought up this issue also, I think that we should be careful to stuff
> the minimum into the transport API's and leave the remainder to a different
> piece of software that can communicate to us things that particular
> frameworks want. I'm not quite sure what I'm saying here in terms of types.
>
> I hope this is helpful - it is clearly only a part of the puzzle.

I think we'll need to get a bit more concrete. I'm not sure I see all
the consequences of your comments yet.

I'm trying to make things a bit more concrete this with my code for a
prototype transport interface, impl, and trivial cloud-haskell
implemented on top.

Duncan Coutts

unread,

Nov 7, 2011, 1:04:04 PM11/7/11

to rrne...@gmail.com, parallel...@googlegroups.com

On Fri, 2011-11-04 at 00:43 -0400, Ryan Newton wrote:

> > Or maybe you mean sending a single logical message but not necessarily
> > requiring the whole sequence of chunks to be present in memory at once
> > (as in traditional vectored send) but building up a single logical
> > message piece by piece. Would that have any network performance
> > advantages over doing things at the message level and sending a stream
> > using multiple logical messages?
>
> Yep, if I send a single 500MB lazy Bytestring as logical send, it is
> probably better to send the smaller pieces as the chunks become
> available.

Absolutely, the question is simply at what layer of the stack you do
that. Given the ability to send messages you can add sequences at the
app level. Is there any benefit to pushing it down into the transport
layer?

Duncan Coutts

unread,

Nov 7, 2011, 1:28:20 PM11/7/11

to Peter Braam, parallel...@googlegroups.com

Right, but when does the app or middleware want to specify these
parameters? Can we get away with having them specified (programmaticly)
when the transport session/environment is initialised or is it essential
that middleware like cloud haskell be able to set these parameters
differently for different connections in the same transport
session/environment?

If it's the latter then we have a problem with being able to provide a
single generic interface that covers multiple transports because
presumably each transport has its own set of special configuration
items.

If it's the former then it's all find because this can be set when
initialising the transport, and initialisation is transport specific.

> A great but simplified overview of this is on an IBM page (which repeats my
> explanation, in their case for sockets)
> http://www.ibm.com/developerworks/aix/library/au-tcpsystemcalls/
>
> For high performance connections (probably including ethernet card with
> offload facilities) there is a legion of parameters, and while it may seem
> that they are exotic, they are definitely not unusual if you need to max
> out performances. For example large scale services reserve considerable
> amounts of memory to avoid running out of buffers (there is often no spare
> memory to allocating it dynamically is not always an option). All of those
> things become parameters. Uugh, I know.

> In other words, I fear we want, as dirty as it may look, more parameters.

There's no problem with a plethora of parameters as long as we don't
have to specify them separately for each connection. If we need separate
parameters on each connection we should see if we can't restrict those
to "hints" or things that have a common meaning on all transports.

> > So with that point of view, it's really newConn* and not connect that
> > should determine the kind of connection. It probably does not make sense
> > for one end of a connection to think it's reliable while the other end
> > thinks it unreliable.

> Actually some parameters like the MTU are negotiated upon connection time
> by the lower layers, but in some API's passed in.

That doesn't immediately sound like a problem.

> > On the other hand, whether the sending end wants to buffer or send
> > messages eagerly is a local decision. So that could be set at connect
> > time.
> >
> > Note that it's not necessarily essential that we provide these various
> > options on the newConn and connect functions. That's only necessary if
> > applications or middleware (like cloud haskell) want to specify these
> > options differently for different connections in the same app context.
> > Otherwise it'd be possible to set these options globally for the whole
> > transport at the point at which the transport is initialised.
> >
> > If we do provide them, we'll have to be clear about if and how they
> > affect the meaning. For example I expect that the buffered vs eager
> > can't actually change the semantics and that it's just a hint which the
> > transport implementation may or may not heed.
> >
> >
> If we make it possible, everyone will be happy. It's amazing that Erlang
> normally gets by without changing such parameters.

Ah, now that's very interesting to hear. Sounds like pretty good
evidence that it doesn't need to be tweaked for each connection, and
that tweaking parameters for all the connections in the session would be
sufficient.

Do you see what I mean about the distinction between whole session and
per-connection?

It's about whether we have:

initIpTransport :: LotsOfParamaters -> IO Transport

or if is essential that we have to have

newConn :: LotsOfParamaters -> IO (SendAddr, ReceiveEnd)

In the first case the parameters are inherited for all connections
created in that Transport session. In the latter, the parameters are
specified for each connection independently.

> However, the core Erlang developers I met told me that their messaging
> had been almost exclusively used in relatively small local area
> network clusters. The problems come from large clusters (one would
> have thought that Twitter or Facebook have seen some with Erlang) or
> wide area, and I could go on for hours describing scars and
> difficulties I've encountered, and usually overcome with all these
> crazy parameters.

Which is fine, if we only have to specify these parameters once for the
whole lot, to tune the middleware to the network envrionment.

Patrick Maier

unread,

Nov 8, 2011, 6:40:47 AM11/8/11

to parallel...@googlegroups.com

Hi all,

I did implement a simple MPI transport layer with a design similar to Duncan's earlier this year. Differences are: No explicit connection setup (because I connect all peers to all others when MPI initialises - simplest choice but does not scale), and I am using lazy bytestrings as Ryan suggested.

I very much agree with Ryan that a lazy Bytestring interface is what we want because it fits nicely with Haskell's semantics, and it lets us even treat streams as single messages. However, the downside of pushing this behaviour into the transport layer is that space can blow up (on the receiving node) if the receiver is slow to consume the stream. Which can easily happen if it is overwhelmed with messages from different nodes.

Ultimately, reliable streaming requires some form of flow control (like in TCP) and that requires a bi-directional connection (in the transport library, at least). I would take that as an indication that the basic primitives should be send and receive for strict bytestrings, and streaming of lazy bytestrings should be implemented on top (still within the transport layer). Applications should be free to choose whether to transmit strict or lazy bytestrings.

Cheers,
Patrick

Peter Braam

unread,

Nov 8, 2011, 12:53:58 PM11/8/11

to Duncan Coutts, parallel...@googlegroups.com

Hi Duncan -

Thanks for these thoughts. I don't see (m)any points where we really diverge, possibly the only exception being extra parameters and the concept of binding. More below....

On Mon, Nov 7, 2011 at 11:00 AM, Duncan Coutts <dun...@well-typed.com> wrote:

On Thu, 2011-11-03 at 12:56 -0600, Peter Braam wrote:
> Hi Duncan -
>
> You beat me to writing it up! I have quite a few comments. I first
> inserted them into your message, but given its length I decided it was
> better to keep them together.

Thanks for the comments.

BTW, I should make clear that I'll be putting this up on a wiki page and
I'll incorporate changes from this discussion.

There is a starting page for transports http://haskell.org/haskellwiki/GHC/CloudAndHPCHaskell/Transport

So I take it you'd like this discussion on the meaning of transports,
endpoints / addresses etc to be included.

> TRANSPORTS

[...] about 'fat' transport libs

> - there are enormous libraries of functionality available (MPI is the best
> example - it has 100's of calls that are not directly transport related).
> Unless re-writing those libraries from scratch is desirable, it's nice to
> have access to all of it.
> - the failure semantics are very different from that of the underlying low
> level transport. E.g. one failure takes and entire MPI process group down
> (normally all nodes running a job, refinements to this are coming but not
> yet stable), most RPC protocols have "their own" timeouts
>
> In my experience the different failure semantics make it extremely
> difficult to put a "fat" protocol in place of a thin one merely for the
> purpose of moving packets. If you use a fat one, you want to be using the
> library of additional routines.

So what is your conclusion here? That it doesn't make sense to implement
this transport API on top of "fat" libs like MPI? In other words, that
if one wants to use MPI that they should use the binding directly which
provides all the functionality.

I think so. It is however possible that "you" meaning the Haskell community is vastly better at describing wildly varying semantics that the "C department". As for the latter I can speak, and it is imho just too difficult to make MPI communications behave like a lower level package would.

> There are some exceptions, like IPSEC tunnels etc, but on the whole the
> security semantics makes at least the connection API quite different
> involving interaction with key / certificate management systems, and in
> many cases (GSS-API) every packet needs special purpose processing, that
> can be application dependent. I think that is best seen as an application
> leveraging the primitives we introduce.

Yeah, I've no idea at the moment if we could make fit this transport
interface over the protocols with security. The problem is that for a
generic interface we can't easily expose special purpose functionality.

With this transport design we can pass in any special configuration,
settings etc when we initialise the transport, but not subsequently when
using it, ie when creating and using connections.

That's probably not good enough. The reason is that one transport may have for example a local and wide area IP network component. And options on the wide area sockets are likely to be different than what one needs locally.

Also, have you considered how "bind" has been buried in this?

> > The basic thing we want is point-to-point send & receive of variable
> > size blobs of bytes. We will put off consideration of other services
> > like broadcast.

> Unfortunately I think you can't put this off, but maybe the definitions you
> give here are good enough to capture it. A scalable client server
> implementation will insist on many (clients) to one (server) communication
> and want a single request buffer. The same holds in reverse for multicast
> to groups of hosts.

I was a bit unclear. I meant individual messages are sent
point-to-point, single sender and single receiver (ie not multicast).
But of course the address corresponding to the receiving end can be
copied across the network so multiple senders can send messages to the
same receiver.

The point is that an address does NOT name an endpoint. Only after it's "bound" using the bind calls in the various libraries does this become so.

For example, one client server design may "accept" connections and create a new endpoint for each client (in particular with separate buffers). Another design may share an endpoint among all clients. The former looks cleaner, but there are reasons to make both possible.

So that covers the many to one case but it doesn't cover
the multicast case of one to many.

It's multicast that I've put off, mainly because from my point of view
it looks pretty tricky and its an area of greatest difference between
underlying transports. In particular it'd require that we support
unordered, unreliable connections with limited message size (since
that's all that multicast on most transports supports) and we've not
decided on that yet.

How important is multicast in our context? What does it get used for? I
know in IP applications like media streaming it's useful, but I can't
immediately think of use cases in our context.

Almost all HPC programs have (sometimes very many) global barriers which benefit significantly from broadcast when the barrier is passed.

> We need some notion of network endpoint and/or address.
>
>
> ENDPOINTS / ADDRESSES
>
> Yes, and I think we agreed that these are two very different concepts.
> The endpoint as I understand it is a networking level object. The address
> is a resource name. Connections are networking level objects (that tie two
> endpoints). See my discussion below on resources.

This touches on the discussion Simon and I were having about SendEnd vs
SendAddr and deserialisation.

> REVIEW OF CURRENT PROGRAMMING PRACTICES IN C
>
> In order to help verifying the types under discussion I reviewed what is
> currently done in C applications. The sequence of setting up
> communications is as follows:

So of course the interesting thing in comparing with the standard C APIs
is that they are working with a model with "well known" addresses, like
dns name + port 80.

No - see below. MPI, utilizing a job schedulers allow address spaces like "communicate job rank 50 in the current batch process" - nobody, except the job scheduler would even know on what machine this might be running.

Our current design is more like the unix pipe()
function that creates two ends of an anonymous pipe, but where we can
move the sending end of the pipe anywhere in the network. So there's no
listening on well known addresses for incoming connections etc.

Interesting. Can you write a small client server program, like I did below for me?

Source and Destination

> The following patterns are common:
>
> 1. a small message is sent directly over the connection (sometimes
> delivered with an event, sometimes inside an event)
>
> 2. a small packet is used to communicate a DMA address in the memory of the
> remote side of the connection. This remote address can always be used to
> send data to (using a local DMA address) and by some API's (more efficient)
> to read from. The C-api is:
>
> dma_send(source_side_addr, dest_side_addr, buflen, buffer *)
>
> 3. a vectorized (so called iov) version of (2) where multiple source
> buffers are registered for DMA and transported to the destination.
>
> Transports (2) and (3) go at bus speeds and involve no CPU on advanced
> networks. The buffers we send with DMA are typically not serialized (and
> copying or packaging in any form is to be avoided). Multiple DMA's may be
> active over an endpoint simultaneously (this is called multiplexing).

I've been assuming that we can avoid having the transport layer's api
expose anything about thresholds between small messages and large
messages that use DMA. So a CCI transport impl would send small messages
in its usual way and for bigger messages, it'd decide based on best
performance if it should send a sequence of small messages or set up a
DMA transfer.

Yes, fully agreed.

> Can the functions above get extra parameters for the iov addresses?
> Not having iov capability is a nono.

I'm happy with us providing a sendVec that uses a list of data chunks.
On the receive side our current design puts the transport layer in
charge of buffer allocation, we just get back the ByteString.

If the transport layer has to use e.g. a kernel level memory management system it will get non-contiguous pages. In kernel space and for applications managing their own page pools having contiguous larger buffers is not common (and impossible often). So ... I think the suggestion you make below makes sense.

As long as we can spin on waiting for data, we'll be good. Motivation: Fast networks have latencies of 300ns at the moment, that is only 3x the memory access time. On such a network waiting for a small message (e.g. in shared memory cases) by spinning is hundreds of times cheaper than re-scheduling a thread (I'd believe it if you told me that Haskell threads are 10x faster, so I put hundreds, to stay ahead of you :) Mind you, the Coda file server on which I worked for many years also had light weight threads as do many libraries now.)

> RESOURCES
>
> Finally I want to come back to the issue of resources. Today it is
> probably still true that some kind of process running on a node is
> responsible for communications, although MPI has replaced that with a
> communicator which is a set of nodes that is communicating (and fails as a
> group). This is about to come to an end, with communications to GPU's,
> FPGA's and switches becoming first class citizens and with resilience
> requiring possibly transparent migration of such entities.
>
> Above I think I touched on the data constructors for endpoints. The
> question here is if endpoints are standalone or always associated with a
> Haskell execution environment? I think we need to be carefully thinking if
> that is compatible with the paradigms / examples we have begun to discuss.
> If we want to use pure TCP sockets to write a web client, we will be
> sending messages to non-Haskell programs - I think we want send and receive
> (with a raw TCP transport) to be useful for that?

I don't think we should be specially targeting interoperability with
other protocols. For example I don't see any point in trying to make an
IP HTTP backend that can work with standard servers out there on the
web. If you want to do that, use an existing Haskell HTTP lib.

OK. But, do we then agree where we stop? MPI and ssh were mentioned ... so are we, and I strongly favor this, delegating these to a set of networking libraries that are beyond transports?

So I'd say that each transport is really defining it's own protocol. If
that happens to match something that is externally meaningful then
that's a bonus but I don't think we should be aiming at that.

> Secondly the issue of strings as addresses isn't sitting well with me.
> Different transports will accept different types of addresses, and it's
> merely the fact that we are dealing with the C department that they are
> currently always packed in a string or integer perhaps. Most transports
> send these addresses through a naming service that translates them into
> some other format.

In our current design the addresses are completely abstract. We use
newConn to create both ends of a pipe, one end has an associated address
that we can move around and use to make a connection. The only thing we
require is that addresses are serialisable.

If we have transport-specific functions for making connections for
well-known addresses then those could use domain-specific types for the
address.

That's what I would like.

> An application framework (MPI, Hadoop, Erlang-like, Cloud-Haskell) appears
> to sometimes have their own address space (MPI rank - the process number -
> being a good example). I know that CCI is using a URI based addressing,
> and discussing a naming service that can accomodate URI's for the
> application frameworks they want to support (e.g. job schedulers reporting
> node and process id).
>
> Ryan brought up this issue also, I think that we should be careful to stuff
> the minimum into the transport API's and leave the remainder to a different
> piece of software that can communicate to us things that particular
> frameworks want. I'm not quite sure what I'm saying here in terms of types.
>
> I hope this is helpful - it is clearly only a part of the puzzle.

I think we'll need to get a bit more concrete. I'm not sure I see all
the consequences of your comments yet.

I'm trying to make things a bit more concrete this with my code for a
prototype transport interface, impl, and trivial cloud-haskell
implemented on top.

In the coming months we will do CCI as a very low level transport. BTW, the "application Peter has in mind" is nothing new - I'd like to extend DPH parallel arrays into the network, in due course, using a very fast transport.

Peter

Peter Braam

unread,

Nov 8, 2011, 1:12:19 PM11/8/11

to Duncan Coutts, parallel...@googlegroups.com

Some are probably OK at the transport level. Here is one that isn't: some "bind" like calls indicate how much buffer space needs to be available - usually lots for servers, little for clients. The issue is that many programs include client and server functionality, so they want two kinds of endpoints at least (and if they are event driven, those are not bound to specific threads even).

If it's the latter then we have a problem with being able to provide a
single generic interface that covers multiple transports because
presumably each transport has its own set of special configuration
items.

Ouch, yes.

If it's the former then it's all find because this can be set when
initialising the transport, and initialisation is transport specific.

> A great but simplified overview of this is on an IBM page (which repeats my
> explanation, in their case for sockets)
> http://www.ibm.com/developerworks/aix/library/au-tcpsystemcalls/
>
> For high performance connections (probably including ethernet card with
> offload facilities) there is a legion of parameters, and while it may seem
> that they are exotic, they are definitely not unusual if you need to max
> out performances. For example large scale services reserve considerable
> amounts of memory to avoid running out of buffers (there is often no spare
> memory to allocating it dynamically is not always an option). All of those
> things become parameters. Uugh, I know.

> In other words, I fear we want, as dirty as it may look, more parameters.

There's no problem with a plethora of parameters as long as we don't
have to specify them separately for each connection. If we need separate
parameters on each connection we should see if we can't restrict those
to "hints" or things that have a common meaning on all transports.

If an instance of the Transport type provides a bind function, can the type of the parameters for that bind function be different for each instance of Transport? If yes, we might be ok. What is not so likely to happen I think is that people expect to run their applications against different transports by just making a one line change.

Nope, because I don't think (as I learned from some of the core Erlang maintainers) that there is enough variety in Erlang network applications at the moment.

Do you see what I mean about the distinction between whole session and
per-connection?

I do. But I've written many apps for which it wouldn't work (the simultaneous client / server functionality being one).

It's about whether we have:

initIpTransport :: LotsOfParamaters -> IO Transport

or if is essential that we have to have

newConn :: LotsOfParamaters -> IO (SendAddr, ReceiveEnd)

In the first case the parameters are inherited for all connections
created in that Transport session. In the latter, the parameters are
specified for each connection independently.

> However, the core Erlang developers I met told me that their messaging
> had been almost exclusively used in relatively small local area
> network clusters. The problems come from large clusters (one would
> have thought that Twitter or Facebook have seen some with Erlang) or
> wide area, and I could go on for hours describing scars and
> difficulties I've encountered, and usually overcome with all these
> crazy parameters.

Which is fine, if we only have to specify these parameters once for the
whole lot, to tune the middleware to the network envrionment.

I'm beating a dead horse by now, but help me to set up endpoints with big and with small buffers within the same application :) ..., ... please.