Alexander,
I really don't see how this solves the problem at hand. Let's boil it down to the simplest example.
I have code running in two nodes, node1 and node2. There are only three ways that processes in node1 can communicate with processes in node2:
1. sending a message to a ProcessId
2. sending a message to a SendPort
3. sending a message to a named process with `nsend :: String -> NodeId -> Message -> Process ()`
I have made some bug fixes to my code, and I want to deploy them. I shut down node2 and upgrade, then bring it back online. All instances of ProcessId and SendPort in node1 are now invalid. How exactly does Paxos help me solve this problem? It seems to me that you're suggesting I should use a different, higher level protocol to solve the problem of inter node communication. Is that right?
I'm not talking about finding a specific bit of data in a cluster of nodes. The ProcessId/SendPort is the public interface processes use to send data to one another. If this interface is being used, and we want to change the code that is running on one of the nodes, how do we continue to address the process from the other nodes? Are you suggesting that quorum algorithms are the solution to this? I can't see how. Group services, yes, and these can of course be built on consensus protocols, though there are other ways too.
It seems to me that we're talking about introducing architectural complexity that belongs in the application space - i.e., is appropriate for some applications, but not all - but not solving the fundamental problem. Using a consensus/quorum algorithm seems like overkill: why not just communicate over HTTP and use a load balancer, then take the backend nodes offline one at a time for upgrade? Both amount to the same answer: use a different protocol for inter-node communication. That's not what we're trying to achieve and it's a pretty broad stroke.
We *do* have open bugs for looking at distributed algorithms including...
1. group services (process groups, global naming, routing, cluster management, etc)
2. atomic broadcast (think zab, rabbitmq's gm and similar)
3. distributed transactions/locking (we already have Jeff's distributed-process-global, we'd like a Paxos-esque set of consensus protocols too)
I'm not looking at any of these at the moment, as there are many far more mundane but equally important tickets that require my attention. If you feel like adding one or more of them to distributed-process-platform though, we'll *very* gladly take a pull request! ;)
> I think this is a better foundation than even the "linked process" model in Erlang, because the "linked process" model still fails in many scenarios, and distributed consensus solves a much larger set of problems.
These are different tools for different problems, operating at a completely different architectural level of detail. Paxos fails in a number of scenarios too, which is why we are interested in leaderless protocols as well as quorum protocols that depend on distributed leader election (which can deadlock).
> Imagine the following problems:
> 1. Restarting a process.
[snip]
> While the process restarts, clients should not in general go to a backup node, but the queries that normally would go to node 4 should instead be evenly distributed among the other 3.
You're assuming a homogeneous cluster. That's fine - and in general, distributed systems *are* much easier to design if 'all nodes are equal' as it were - but it's an assumption that a low level framework cannot make. Besides, even in a homogeneous cluster, what happens when nodes 1-3 have references to ProcessId/SendPort for node 4 which has been restarted!? You've not solved that problem at all, because you're just looking at it from the 'external client' point of view. On the other hand 'internal clients' (i.e., the other nodes) still need to communicate with one another, and can no longer do so using ProcessId/SendPort if the node has been restarted. Using `nsend` *does* alleviate this ill to some extent, but it's not a panacea.
> All of the above problems can be solved by getting consensus on 1) How things are sharded 2) Where things are located 3) What things are live.
>
> Thus I think the basic service that is needed is distributed consensus.
[snip]
> On top of distributed consensus it is possible to build stable names, and efficient request routing where all clients and all servers have the same view of how the service looks like and how it should be interacted with.
That still doesn't solve the problem outlined above; Unless all nodes interact with one another using the consensus protocol and never use ProcessId/SendPort to communicate with one another!
> Whether a process is live or dead can be solved by consensus, and it doesn't have split-brain issues, and can deal with the failure of multiple processes. Stable name-lookup, that is service name to set of IPs, and a representation for how queries should be routed can be dealt with by distributed consensus.
Right. So this *is* a useful thing to have if you're happy to program the cluster at this level and not to use low level primitives like ProcessId/SendPort. That's fine and as I say, we'd be delighted to offer an API at this level of abstraction on top of the existing infrastructure.
> State that need to be consistent could be sharded with replicas, but with a stable access pattern (first go to primary, then secondary, etc). Then there would be mostly-consistent data that would be spread around in caches etc.
Right. But again, you're assuming that the ProcessId/SendPort is not part of that data and that all inter-node communication is done via group services. Let me tell you now, that no matter how tight your implementation - and e.g., `fast-paxos' is *very* quick - there is significant overhead in making all internode communication go this route compared to sending direct to the target process. Sometimes you actually want to just send a message to a process, without going around the houses.
> TL;DR - Process migration is not a relevant paradigm anymore.
I didn't realise it was considered a paradigm. Erlang doesn't *do* process migration in order to change the code at runtime. You just give the code server the new compiled modules at runtime and the recursive tail call for the servers enters the new version of `loop(State) -> ....' and it's job done. Well, that's the *short* version - it's a bit more involved in practise of course. It's also a goo bit more involved doing this in Haskell, but we're exploring whether or not such a facility might be useful and how we might implement it. The idea of using 'process migration' was one of several suggestions about how to solve the problem that you might want to replace version 1 of the *code* with version 2 without suffering downtime and in a manner transparent to the other nodes which are sending you messages.
Like I said, I'm not going to be working on anything like this for *ages* as there are far too many other things to do (i.e., optimisations to message passing, tracing, supervision trees, etc) so if you know what you want this to look like and are willing to contribute, that would be fantastic!
Cheers,
Tim
On 10 Feb 2013, at 19:46, Alexander Kjeldaas wrote:
> TL;DR - Process migration is not a relevant paradigm anymore.
>
> Process migration itself is a slightly narrow focus.
>
> Imagine a "large" set of data, 100 records, served by 4 copies of a process where each process holds 50 records, so that each record is duplicated in two places.
>
> Imagine the following problems:
> 1. Restarting a process.
> 2. Resizing this set of servers in response to increased load (to 5 or 6 servers).
> 3. Optimal routing of a single request to a given server.
> a) In the case where it is a read-only query
> b) In the case where it is a mutation of a record.
>
> Doing 1 efficiently either involves using the 3 other processes to prime the new process, or to have distributed stable storage (but that might be no faster than shipping the in-memory data from the 3 other processes).
>
> While the process restarts, clients should not in general go to a backup node, but the queries that normally would go to node 4 should instead be evenly distributed among the other 3. In general, the routing of requests from N clients to a service implemented by M machines would change.
>
> Doing 2 efficiently involves doing an optimal shuffle of the data on all servers, so that typically 10 records from each of the 4 servers are used to prime a new server. Additionally, traffic might need to be duplicated to the new server for a while to prime caches.
>
> Doing 3 efficiently means having the IP of the server that should be contacted be a function of the request itself. For example if the records are sharded based on a unique key that is part of the query, then the high-level sharding information should be known to the client doing the request, or a proxy that sits with this knowledge should be used. In case 3a) any of the two servers that hold the required data could be queried based on a round-robin scheme, a least-loaded scheme or some other application specific load balancing scheme. In case 3b), a more rigorous scheme might be needed, such as always writing to the "primary server" among the two, and only writing to the backup once consensus that the primary server is down is achieved.
>
> All of the above problems can be solved by getting consensus on 1) How things are sharded 2) Where things are located 3) What things are live.
>
> Thus I think the basic service that is needed is distributed consensus. I think this is a service that logically (although not implementation wise) sits right above the transport layer.
>
> On top of distributed consensus it is possible to build stable names, and efficient request routing where all clients and all servers have the same view of how the service looks like and how it should be interacted with.
>
>