Good idea on bringing in the list, probably should have these conversations around for posterity whenever more volunteers come along.
I think that assumption is spot on, I guess my point assumed that the data is available on a set of machines equally, and you have a frontend application (say a web server) that wants to delegate work to that set of machines. From the application server's perspective, I should be able to write code that is location-agnostic when location of the data is irrelevant, but processing power is - as well as write code that will find the data when location does matter, and processing power is less important. I hope I'm making sense, but I think you get the gist of what I'm saying.
This solves the problem of location-aware data processing really well, but it obviously doesn't work well with the idea of location-agnostic data processing - this is where the complexity gets introduced with my suggestion I think.
So if I understand correctly, the supervisor algorithm (as I'll call it for now) would be performed on a per-group (which translates to per-machine, or per-datacenter?) level, and would need to monitor how many times a dereference results in a move from it's group to another (and vice-versa), and at a certain threshold, would communicate with the other group's supervisor to arrange moving the data from that group to it's own in order to reduce the number of dereferences. Is that correct?
One thing I'm wondering is how do we track the moves? Increment a counter? Send a message to the supervisor which maintains it's own internal counter? Also, if we plan on abstracting the level at which the algorithm runs (per-machine vs per-datacenter), how do we handle coordinating the move of data such that the receiving end is actually capable of storing that data?
It seems to me that we would have to write the supervisor to run at the per-machine level so that it is aware of what resources it has free (possibly not only for handling moves of data, but for passing along a computation if local CPU/memory resources are too low?), and that supervisors would also have to have some concept of a higher-level grouping (supervisor A knows that it is in the same group as supervisor B, and since supervisor C is requesting data from A and B's group often, they coordinate to send the data in question to C's group, but evenly distributed over the machines in C's group).
Hopefully I'm making sense ( just tell me if I'm not, I can take it :) ).
Paul