Ok, I'm going to take another whack at explaining my idea for a clustering algorithm, this time without mixing terminology. Please disregard my previous attempt to explain this.
So we have data distributed across multiple computers, where the execution of a computer program always occurs on the computer where the data resides. This means that if a computer program must read or write some data on computer A, and then subsequently read or write information on computer B, we need to stop the program, serialize the continuation, transmit it from computer A to B, and resume on computer B. In other words, the execution "hops" from one computer to another. This way, from the programmer's perspective, they are always working with local data.
Problem is that this transmission of a serialization is extremely expensive, and so our goal is to arrange the data across these multiple computers so as to minimize the number of times the continuation must hop.
Any algorithm that does this must begin by collecting data about how often a
reference is dereferenced (whether locally or remotely). We can use some kind of decaying running average here. We might need to experiment a little to find the best approach. So now, for every reference we have some metric that represents something like "dereferences per second" or DPS.
So, with this data we can now determine - for any given arrangement of data - the number of remote dereferences per second (RDPS). Our goal is to move data around to minimize this number.
One wrinkle that I won't delve into at this stage is that no one computer will have a complete overview of where all data is located and all of the DPS. Rather, I think the computers will need to negotiate in pairs, each pair trying to improve the overall network by swapping data.
The idea is that a computer will identify a starting point, perhaps the remote reference to the other computer it is negotiating with with the highest number of dereferences.
It then asks: What impact will it have on the RDPS if this data is moved from the remote node to me. Let's say it goes from 1000 to 1200. We then repeat this process (we're not actually moving the data, this is just the planning phase) with the assumption that this data was moved. The approach is a bit like a chess algorithm gaming out future chess moves. The details of exactly how we decide which move to try next will require further discussion and perhaps experimentation.
At a certain point we look at the best configuration we've found (the one with the lowest RDPS), and if it is lower than the current RDPS, we move the data.
This may be quite CPU intensive, however there are various efficiencies we can introduce. For example, if there is a group of data that it tightly inter-connected, but sparsely connected outside the group, then we can treat this as a single piece of data for the purposes of the planning algorithm.
Does that make sense? Clearly this is quite half-baked right now, but with further discussion I think we can fill in the details.
Ian.