Different behavior whether a process is spawned locally or on different node

24 views
Skip to first unread message

Yorick Laupa

unread,
Dec 22, 2014, 12:03:50 PM12/22/14
to cloud-haskel...@googlegroups.com
Hi everyone !

There is a really strange behavior we have with distributed-process. We figured out a contrieved project that reproduces it (mail attachement).

In few words, for a given Process, its behavior will not be the same whether it is spawned locally or on new node.

Here's our simple use-case:

Considering 3 entities:
  1. Controller (main thread)
  2. Queue
  3. Forwarder
There are 2 situations:
  1. The Forwarder and the Controller share the same node
  2. The Forwarder and the Controller have different nodes

Forwarder is the process which doesn't have  a consistent behavior (our guess).

Notes on the scenario:

The Forwarder knows Controller's ProcessId

Controller spawns the Queue locally. Controller sends a Forwarder message to Queue, containing the ProcessId of Forwarder.

Once Queue received a Forwarder message, it sends a Forward message to Forwarder

When Forwarder receives a Forward message, it sends to Controller a Forwarded message

--

At the end, we expect Controller to received a Forwarded message regardless if Controller and Forwarder share the same node.

Unfortunately, that isn't the case.

When Controller and Forwarder share the same node. It works as expected

When Controller and Forwarder are not on the same node. It doesn't work. To be more specific, it depends ! We have 3 groups in the team having different output

  • Group 1: It works, always.
  • Group 2: It works on the second run, then always works.
  • Group 3: It never works, including myself

By 'doesn't work', we mean Controller never gets a Forwarded message.

Any idea ?

Thanks for your help

Yorick

Personal system:

OS: Archlinux (kernel 3.17.6-1-ARCH)

GHC: 7.8.3

Cabal: 1.20.0.3

dp-issue-0.1.0.0.tar.gz

Tim Watson

unread,
Dec 22, 2014, 4:35:18 PM12/22/14
to cloud-haskel...@googlegroups.com
Ok, so this always seems to fail for me. I'll look into it, though nothing in the code stands out as wrong at first glance.

Have you tried enabling tracing/debugging on both the nodes? That'll be my first port of call when I find time to dig into this properly, though that won't be immediately.


Cheers,
Tim

Tim Watson

unread,
Dec 22, 2014, 4:43:01 PM12/22/14
to cloud-haskel...@googlegroups.com
Ok, here's the output for both nodes in the failing remote case:

Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:5
Mon Dec 22 21:40:27 UTC 2014 - MxRegistered pid://127.0.0.1:4907:0:4 "tracer.initial"
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4907:0:5 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SOtracer.initial\SOH" :: (75cbdbb96e0c2aa2,5cb248d8fce66aa4)
Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:6
Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:7
Mon Dec 22 21:40:27 UTC 2014 - MxProcessDied pid://127.0.0.1:4907:0:5 DiedNormal
Mon Dec 22 21:40:27 UTC 2014 - MxRegistered pid://127.0.0.1:4907:0:6 "mx.table.coordinator"
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4907:0:7 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DC4mx.table.coordinator\SOH" :: (75cbdbb96e0c2aa2,5cb248d8fce66aa4)
Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:8
Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:9
Mon Dec 22 21:40:27 UTC 2014 - MxProcessDied pid://127.0.0.1:4907:0:7 DiedNormal
Mon Dec 22 21:40:27 UTC 2014 - MxRegistered pid://127.0.0.1:4907:0:8 "logger"
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4907:0:9 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKlogger\SOH" :: (75cbdbb96e0c2aa2,5cb248d8fce66aa4)
Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:10
Mon Dec 22 21:40:27 UTC 2014 - MxProcessDied pid://127.0.0.1:4907:0:9 DiedNormal
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4907:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:4982:0W\252\251X\NUL\NUL\NUL\n" :: (b2fb9f6be9976d4b,7f7dd7784ec2eb73)
Mon Dec 22 21:40:27 UTC 2014 - MxSent pid://127.0.0.1:4982:0:10 pid://127.0.0.1:4907:0:10 [unencoded message] :: ()
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4907:0:10 "\NUL\NUL\NUL\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:4982:0W\252\251X\NUL\NUL\NUL\v" :: (b2fb9f6be9976d4b,7f7dd7784ec2eb73)
Mon Dec 22 21:40:27 UTC 2014 - MxSent pid://127.0.0.1:4982:0:11 pid://127.0.0.1:4907:0:10 [unencoded message] :: ()
Mon Dec 22 21:40:27 UTC 2014 - MxSpawned pid://127.0.0.1:4907:0:11
Mon Dec 22 21:40:27 UTC 2014 - MxSent pid://127.0.0.1:4907:0:11 pid://127.0.0.1:4907:0:10 [unencoded message] :: Forwarder
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4982:0:4 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:4982:0W\252\251X\NUL\NUL\NUL\v" :: (588a9aec00193fb1,fbffaa7fe16d20fa)
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4982:0:4 "\ETX\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:4982:0W\252\251X\NUL\NUL\NUL\v\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NULTuser error (Error: Could not resolve closure: Invalid static label 'Main.forwarder')" :: (588a9aec00193fb1,fbffaa7fe16d20fa)
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4982:0:4 "\v\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:4982:0W\252\251X\NUL\NUL\NUL\n" :: (588a9aec00193fb1,fbffaa7fe16d20fa)
Mon Dec 22 21:40:27 UTC 2014 - MxLog " [network] invalid request: 4294968323"
Mon Dec 22 21:40:27 UTC 2014 - MxLog " [network] invalid request: 4294968323"
Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4907:0:11 [unencoded message] :: Forwarder
Mon Dec 22 21:40:27 UTC 2014 - MxSent pid://127.0.0.1:4982:0:11 pid://127.0.0.1:4907:0:11 [unencoded message] :: Forward
Mon Dec 22 21:40:27 UTC 2014 - MxLog " [network] invalid request: 4294968324"
Mon Dec 22 21:40:27 UTC 2014 - MxProcessDied pid://127.0.0.1:4907:0:11 DiedNormal
Mon Dec 22 21:40:27 UTC 2014 - MxLog " [network] invalid request: 4294968324"
Mon Dec 22 21:40:27 UTC 2014 - MxLog " [network] invalid request: 4294968324"

Looks like the remote table isn't configured correctly to me, hence:

Mon Dec 22 21:40:27 UTC 2014 - MxReceived pid://127.0.0.1:4982:0:4 "\ETX\NUL\NUL\NUL\NUL\NUL\NUL\NUL\DLE127.0.0.1:4982:0W\252\251X\NUL\NUL\NUL\v\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NULTuser error (Error: Could not resolve closure: Invalid static label 'Main.forwarder')" :: (588a9aec00193fb1,fbffaa7fe16d20fa)

Why might that be?

Tim Watson

unread,
Dec 22, 2014, 4:45:23 PM12/22/14
to cloud-haskel...@googlegroups.com
PS: generated by running the executable (with the sameLocalNodeForwarder test case commented out) using the following flags:

DISTRIBUTED_PROCESS_TRACE_FILE=trace.log DISTRIBUTED_PROCESS_TRACE_FLAGS=pdnusrl ./dist/dist-sandbox-c5f385f6/build/dp-issue/dp-issue

Tim Watson

unread,
Dec 22, 2014, 4:47:12 PM12/22/14
to cloud-haskel...@googlegroups.com
Oh, and I also altered one function to relay remote trace events:

newLocalNodeForwarder k = do
    self      <- getSelfPid
    name      <- liftIO generateName
    transport <- liftIO $ newTransport hostname name
    bracket (liftIO $ newLocalNode transport initRemoteTable)
            (liftIO . closeLocalNode) $ \forwarderNode ->
      let forwardedNodeId = localNodeId forwarderNode in do
---->        startTraceRelay forwardedNodeId
        bracket (spawn forwardedNodeId $ $(mkClosure 'forwarder) self)
                (const $ return ())
                k

Tim Watson

unread,
Dec 22, 2014, 4:50:06 PM12/22/14
to cloud-haskel...@googlegroups.com
PPS (and last post for tonight): in general, Control.Distributed.Process.Debug is your friend, as is the whole management API - http://hackage.haskell.org/package/distributed-process-0.5.3/docs/Control-Distributed-Process-Management.html - if you need to dig into what goes on in your processes and nodes at a low level of detail.

Cheers,
Tim

Yorick Laupa

unread,
Jan 2, 2015, 8:13:22 AM1/2/15
to cloud-haskel...@googlegroups.com
Thanks Tim and sorry for the delay.

In the meantime, we figured out what was wrong on the 'Real World' project.

Note on that pet project:

You were right, we used different remoteTables causing an error on static value lookup. Turned out that project didn't represent
our issue with distributed-process though.

The real project run code inside a container (Docker). Caveat was, in some environment,
we send over the network ProcessId with 127.0.0.1 address to Docker-ized application. Leading the node to a Process that doesn't exist.

Thanks for your help.

Regards,

Yorick
Reply all
Reply to author
Forward
0 new messages