RPC over channels

356 views

Skip to first unread message

tobere...@gmail.com

unread,

Aug 1, 2013, 11:32:02 AM8/1/13

to clo...@googlegroups.com

With core.async, cljzmq, and now zmq-async, there's an opportunity to link RPC to channels and be free of the underlying transport system. I'm proposing an RPC library that sends and receives data over channels.

The idea is to have a two-way communication context as a map with a send-channel, receive-channel, serializer, and deserializer. The user would be responsible for the transport mechanism between the "client" and the "server".

When a user makes a request, the library would create a map to follow the JSON-RPC and attach a UUID to it then send it down the send-channel. This UUID could be used in conjunction with pulling items off of the receive-channel to handle responses.

The data flow would be:
user-request -> wrap-in-specification -> client-serializer -> client-send-channel -> transport-provider -> server deserializer -> server-receive-channel -> user-function -> server-serializer -> server-send-channel -> transport-provider -> client-deserializer -> unwrap-specification -> client-receive-channel -> user-response-handling.

The library would be responsible for wrapping the requests and responses according to the specification, but not the serialization mechanism nor the transport, since those could be very different depending on the problem. Utilities could be provided for a "blocking call" and managing cycling through responses on the receive-channel in order to find a specific response that has not yet been processed.

As an example, the above workflow should be able to work seamlessly with the channels provided by zmq-async. The send-channels provided would be used as is, and the receive-channels would be used with an additional channel outside of it to take the responses off and deserialize them before returning to the user. The transport provider, of course, would be ZeroMQ.

Are there any projects already doing this? Would people be interested? Any feedback?

-ToBeReplaced

Timothy Baldridge

unread,

Aug 1, 2013, 11:49:16 AM8/1/13

to clo...@googlegroups.com

How would a blocking call be created? How would you implement this in a way that wouldn't lock one machine to another machine's performance. For quite some time now, RPC has been considered a very bad idea since it mixes reliable and unreliable semantics.

With a network you're never actually sure that a message will arrive (see the messenger problem). And yet one expects procedure calls to be reliable. So what happens when you make a RPC call, and then the remote server dies? What are the semantics on the sender side?

Secondly, RPC kind of flies in the face of Clojure in general. Sending procedure calls? Why not send data?

I've been thinking about this a lot recently, and how the entire process can be made to be lock safe. That is, when a remote server dies, the system continues to respond as normal (see Joe Armstrong's thesis). I'm not completely convinced that the CSP style works well in this context, and that instead we should perhaps model these systems after unreliable message passing, as that's really all that you are promised by the network protocol anyways.

Timothy Baldridge

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

tobere...@gmail.com

unread,

Aug 1, 2013, 1:09:10 PM8/1/13

to clo...@googlegroups.com

I'm thinking that a blocking call would be an ordinary call that subsequently continuously reads from the receive-channel and checks if the id matches, returns if it does, and puts the value back on the channel otherwise. In a blocking call, you are always locked to the performance of the server. The benefit here is that all you know about your server is that it is something sitting on the other side of your channel. With ZeroMQ, it might be a router that delivers your request to the machine it thinks is most likely to be able to handle the request. The system serving your requests could be a single process executing blocking calls to a third-party http rest api, or it could be a network of load-balancing machines.

There would be no intent to solve the messenger problem explicitly -- those semantics are up to the user. By default, in the case of server death, the client side will just no longer receive responses on its receive-channel. In the case of a blocking call, this means that the client side will hang. It might be possible to have configurable semantics schemes included with the library -- This isn't an area I'm well-versed in, but maybe someone else could help explore that? Also, sometimes, your transport provider may be able to provide promises like delivery assurances (like Rabbit). There would be nothing to prevent a transport provider from enforcing its own semantics.

In what ways are procedure calls different from data? It feels like a procedure call is just a data contract. This library would be doing the job of wrapping up two different queues -- one for outbound requests and one for inbound responses. It's true that you'd much rather only have the outbound queue, but sometimes you need both.

-ToBeReplaced

Cedric Greevey

unread,

Aug 1, 2013, 3:36:47 PM8/1/13

to clo...@googlegroups.com

On Thu, Aug 1, 2013 at 1:09 PM, <tobere...@gmail.com> wrote:

There would be no intent to solve the messenger problem explicitly -- those semantics are up to the user. By default, in the case of server death, the client side will just no longer receive responses on its receive-channel. In the case of a blocking call, this means that the client side will hang.

Ugh. At the *very* least it should eventually return with some kind of a timeout exception.

ToBeReplaced

unread,

Aug 1, 2013, 3:55:00 PM8/1/13

to clo...@googlegroups.com

A client could use a timeout channel as its receive channel to get timeouts on a blocking call, though I don't think that's the right way to do it. Alternatively, implementing a blocking call with an optional timeout wouldn't be difficult, it just can't be the default.

I think if you disallowed nil as a response, then it would be easy to use a variety of different blocking calls -- wait forever, wait 30 seconds, wait until (f) returns truthy, etc.

-ToBeReplaced

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/P95cOfuXBUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Timothy Baldridge

unread,

Aug 2, 2013, 9:46:31 AM8/2/13

to clo...@googlegroups.com

RPC ties a local process (and all the memory its currently using) to a remote process. It glosses over the fact that that the link between these two processes is quite unreliable. In his thesis, Joe Armstrong also points that this system is not only susceptible to hardware/network failure, it's also very susceptible to programming failures. A bug in your code could cause every 100th RPC call to fail, for example.

So instead of all of this, Erlang (or actually Erlang's OTP libraries) proposes a different view:

1) all messages are sent async and unreliable. This is the way networks are designed anyways, if you sent a message to a remote system and it gets thrown into a queue, you really don't know what happens to that message in the end, except by asking the remote system again, and again until you finally get an ACK.

2) If we follow the above model, then we can start programming in a fail-fast mode. This is what OTP is based on. Supervisor processes that reset dead sub processes. This is also why I'm not a fan of populating error messages across channels. Instead, errors should kill go blocks, and then those blocks should be restarted by supervisors.

So RPC assumes that every call will succeed. Message passing systems assume that it's kind-of-sort-of-not-really-likely that your message will arrive at the remote system. It's a pessimistic view of the world. And with the systems I work with, this is the best approach.

So I guess this is a super long way of saying I love the OTP style and would love to see it ported to core.async. But RPC is not the way, blocking send/recv is not the way. To get reliable systems you need your system to always be capable of forward progress, and having a local process tightly coupled to a remote process will not get you there.

Timothy

You received this message because you are subscribed to the Google Groups "Clojure" group.

To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Marc Hämmerle

unread,

Aug 2, 2013, 10:16:57 AM8/2/13

to clo...@googlegroups.com

On Erlang: sometimes you *want* to block on a node and wait for the answer of a remote node. It's implemented as message passing under the hood - the process gets de-scheduled, restarted in case of crash if you want et al. - but the semantics are clearly sequential and blocking. Erlang obviously also has benefits in that area with OTP supervision and the lightweight processes - but doesn't core.async have at least the last one too?

Take a look into Basho's riak_core (https://github.com/basho/riak_core) and other distributed systems written in Erlang: heavy use of RPC calls.

So I'd say on RPC: clearly depends on your needs and is not automatically bad.

Marc

ToBeReplaced

unread,

Aug 2, 2013, 10:36:41 AM8/2/13

to clo...@googlegroups.com

I'm uneducated here since I've not used Erlang in any real capacity.ï¿½ I thought that typically if A sent a message to B that caused B to raise an exception, that B would send an exception back to A and that A would error out.ï¿½ Thus, the client is blamed for the exception.ï¿½ Is this correct?ï¿½ In clojure, I would have written this pattern via populating channels with exceptions:

server -> receive-channel -> deserialize-or-throw-if-exception-on-channel -> client-receive-channel.

The deserialize-or-throw-if-exception-on-channel would be in a go block, which would error out before the client sees anything.ï¿½ If a supervisor wants to restart that it would be their business.ï¿½ In the event that every 100th RPC call receives one of these exceptions, eventually every client would die, or their supervisors would restart them.

I'm wondering if maybe I have different expectations of RPC from Timothy.ï¿½ I would like to view RPC as inherently unreliable.ï¿½ "send" would have many different options -- send and pray, send and wait for ack, send and return control before you've finished serializing, etc.ï¿½ "receive" could be async, could be blocking, could require a UUID it's looking for a response to, could have a timeout, could have a timeout function, etc.

What would OTP look like over channels?

-ToBeReplaced

On 08/02/2013 10:16 AM, Marc Hï¿½mmerle wrote:

On Erlang: sometimes you *want* to block on a node and wait for the answer of a remote node. It's implemented as message passing under the hood - the process gets de-scheduled, restarted in case of crash if you want et al. - but the semantics are clearly sequential and blocking. Erlang obviously also has benefits in that area with OTP supervision and the lightweight processes - but doesn't core.async have at least the last one too?

Take a look into Basho's riak_core (https://github.com/basho/riak_core) and other distributed systems written in Erlang: heavy use of RPC calls.

So I'd say on RPC: clearly depends on your needs and is not automatically bad.

Marc

On 2 August 2013 15:46, Timothy Baldridge <tbald...@gmail.com> wrote:

RPC ties a local process (and all the memory its currently using) to a remote process. It glosses over the fact that that the link between these two processes is quite unreliable. In his thesis, Joe Armstrong also points that this system is not only susceptible to hardware/network failure, it's also very susceptible to programming failures. A bug in your code could cause every 100th RPC call to fail, for example.ï¿½

So instead of all of this, Erlang (or actually Erlang's OTP libraries) proposes a different view:

1) all messages are sent async and unreliable. This is the way networks are designed anyways, if you sent a message to a remote system and it gets thrown into a queue, you really don't know what happens to that message in the end, except by asking the remote system again, and again until you finally get an ACK.ï¿½

2) If we follow the above model, then we can start programming in a fail-fast mode. This is what OTP is based on. Supervisor processes that reset dead sub processes. This is also why I'm not a fan of populating error messages across channels. Instead, errors should kill go blocks, and then those blocks should be restarted by supervisors.ï¿½

So RPC assumes that every call will succeed. Message passing systems assume that it's kind-of-sort-of-not-really-likely that your message will arrive at the remote system. It's a pessimistic view of the world. And with the systems I work with, this is the best approach.ï¿½

So I guess this is a super long way of saying I love the OTP style and would love to see it ported to core.async. But RPC is not the way, blocking send/recv is not the way. To get reliable systems you need your system to always be capable of forward progress, and having a local process tightly coupled to a remote process will not get you there.ï¿½

Timothy

On Thu, Aug 1, 2013 at 1:55 PM, ToBeReplaced <tobere...@gmail.com> wrote:

A client could use a timeout channel as its receive channel to get timeouts on a blocking call, though I don't think that's the right way to do it.ï¿½ Alternatively, implementing a blocking call with an optional timeout wouldn't be difficult, it just can't be the default.

I think if you disallowed nil as a response, then it would be easy to use a variety of different blocking calls -- wait forever, wait 30 seconds, wait until (f) returns truthy, etc.

-ToBeReplaced

On 08/01/2013 03:36 PM, Cedric Greevey wrote:

On Thu, Aug 1, 2013 at 1:09 PM, <tobere...@gmail.com> wrote:

There would be no intent to solve the messenger problem explicitly -- those semantics are up to the user.ï¿½ By default, in the case of server death, the client side will just no longer receive responses on its receive-channel.ï¿½ In the case of a blocking call, this means that the client side will hang.

Ugh. At the *very* least it should eventually return with some kind of a timeout exception.

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/P95cOfuXBUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

ï¿½
ï¿½

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ï¿½
ï¿½

--
ï¿½One of the main causes of the fall of the Roman Empire was thatï¿½lacking zeroï¿½they had no way to indicate successful termination of their C programs.ï¿½
(Robert Firth)

--

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ï¿½
ï¿½

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/P95cOfuXBUs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ï¿½
ï¿½

James Ashley

unread,

Aug 2, 2013, 11:05:38 AM8/2/13

to clo...@googlegroups.com

RPC over channels

tobere...@gmail.com Aug 01 10:09AM -0700

With
ZeroMQ, it might be a router that delivers your request to the machine it
thinks is most likely to be able to handle the request. The system serving
your requests could be a single process executing blocking calls to a
third-party http rest api, or it could be a network of load-balancing
machines.

...

There would be no intent to solve the messenger problem explicitly -- those

semantics are up to the user. By default, in the case of server death, the

client side will just no longer receive responses on its receive-channel.

In the case of a blocking call, this means that the client side will hang.

Hung clients are bad.

Reliable request/reply gets hairy very quickly. The zeromq guide devotes a long chapter to it.

The simplest (and least useful) approach they suggest is what they call the Lazy Pirate Pattern.

The nutshell version:

The server listens on a robust socket that can handle multiple requests from the same client (a Dealer, in 0mq terms). The client connects with their brain-dead REQ socket, sends the request, then periodically polls for a response. If it doesn't get a response before some time out, it drops the socket and reconnects. After a certain number of retries, it gives up and notifies the user that it's lost the server connection.

There's obviously a lot of black magic going on behind the scenes, but that's a big part of the point of using an MQ.

I don't have any idea about how the other MQ's handle this sort of thing.

Just to try to keep this on-topic:

I haven't had a chance yet to experiment with how well core.async cooperates with 0mq's basic strategy. It looks like there are a lot of really cool possibilities, and I think that some of the projects I'm seeing in this area are really exciting.

Regards,

James

David Pollak

unread,

Aug 2, 2013, 4:58:50 PM8/2/13

to clojure

This is a tough and interesting issue.

Let's put aside the whole RPC issue for a moment and look at how code progressed from C-land to Java-land.

In C, the developer had to check return values from function calls to see if the function succeeded. That led to ignoring return values or testing with a nested block of code:

if (succeeded(open_file(name, &file_struct))) {

if (succeeded(do_something_else(...))) {

do work;

} else {
close resources; release memory; return error;
}

} else { close resources; release memory; return error;}

Java improved on this with exceptions and garbage collection:

try {

File f = open_file(name);

try {

Something s = do_something_else(...)

} finally {

close f;

} catch (...) {

// handle inner and outer errors

}

Add ARM (automatic resource management) and you have something that looks a lot like a supervisor in Erlang (in my opinion)... basically you run all your code and the exception handlers handle the exceptional situations.

So, this works for both local calls and remote calls (putting aside the marshalling of parameters and state across address spaces) as long as the number of threads of execution can be handled by the runtime/operating system.

But in the JVM, we can no longer build systems that cause threads to block on the execution of off-process code because we will need a lot more threads than the OS (because the JVM is native threaded and can have about 4K threads) can provide.

This is where core.async is particularly nice (it's double extra nice in JavaScript-land where there's only 1 thread, so blocking is verboten). Basically, rather than having nested call-back functions (like our C-style nested if statements), core.async re-writes the code such that where there are blocking calls to a channel, the code is re-written so that it is in effect a series of call-backs, but the code that the developer writes is linear.

Put another way, core.async gives us the syntax of a normal flow, but with the performance of releasing the thread until a result is available and then continuing the logical computation.

I think the idea of extending the ideas in core.async to accessing remote systems makes a lot of sense. Call it RPC. Call it something else... but it's the same concept... you give the developer linear looking code that "does the right thing" with off-process calls. The right thing being releasing the current JVM thread until the answer comes back, dealing correctly with timeouts, and correctly handing failures by releasing resources as well as invoking appropriate exception handling code (supervisor?).

If we don't have a nice layer like this, we're stuck writing code like:

(let [v (alt! (some_call_to_external_system_that_returns a channel) (timeout 1000))]

(if (not (timeout? v)) (do something with v) (raise error)))

Where we are repeating the boilerplate of timeouts and we probably have to test v against nil (the channel is closed) and exceptions.

Instead, if we have:

(rpc #(handle_error %) 1000 ;; default timeout

(let [v call_to_external_system] (do something with v)))

And we do the same deep walking with the rpc macro that's done with the go macro so we identify all the remote calls and turn them into the go-based code with alt! and the timeout boilerplate.

And then we've got a nice rpc system layered on top of core.async that has default timeouts... and if rpc supports nesting, then we can tune the timeout for a given call.

So, I think the concepts and the code in core.async lend themselves directly to building an rpc-style system where the boilerplate for dealing with invoking off-process resources and getting the results from these off-process resources with correct timeout, exception handling, and resource release.