Proposal for clustering algorithm

81 views
Skip to first unread message

Ian Clarke

unread,
Aug 24, 2013, 11:22:07 AM8/24/13
to swarm-...@googlegroups.com, James Earl Douglas
Ok, I'm going to take another whack at explaining my idea for a clustering algorithm, this time without mixing terminology.  Please disregard my previous attempt to explain this.

So we have data distributed across multiple computers, where the execution of a computer program always occurs on the computer where the data resides.  This means that if a computer program must read or write some data on computer A, and then subsequently read or write information on computer B, we need to stop the program, serialize the continuation, transmit it from computer A to B, and resume on computer B.  In other words, the execution "hops" from one computer to another.  This way, from the programmer's perspective, they are always working with local data.

Problem is that this transmission of a serialization is extremely expensive, and so our goal is to arrange the data across these multiple computers so as to minimize the number of times the continuation must hop.

Any algorithm that does this must begin by collecting data about how often a reference is dereferenced (whether locally or remotely).  We can use some kind of decaying running average here.  We might need to experiment a little to find the best approach.  So now, for every reference we have some metric that represents something like "dereferences per second" or DPS.

So, with this data we can now determine - for any given arrangement of data - the number of remote dereferences per second (RDPS).  Our goal is to move data around to minimize this number.

One wrinkle that I won't delve into at this stage is that no one computer will have a complete overview of where all data is located and all of the DPS.  Rather, I think the computers will need to negotiate in pairs, each pair trying to improve the overall network by swapping data.

So how do we figure out what data to swap?  I would suggest some kind of search algorithm, like best-first search with backtracking.

The idea is that a computer will identify a starting point, perhaps the remote reference to the other computer it is negotiating with with the highest number of dereferences.

It then asks: What impact will it have on the RDPS if this data is moved from the remote node to me.  Let's say it goes from 1000 to 1200.  We then repeat this process (we're not actually moving the data, this is just the planning phase) with the assumption that this data was moved.  The approach is a bit like a chess algorithm gaming out future chess moves.  The details of exactly how we decide which move to try next will require further discussion and perhaps experimentation.

At a certain point we look at the best configuration we've found (the one with the lowest RDPS), and if it is lower than the current RDPS, we move the data.

This may be quite CPU intensive, however there are various efficiencies we can introduce.  For example, if there is a group of data that it tightly inter-connected, but sparsely connected outside the group, then we can treat this as a single piece of data for the purposes of the planning algorithm.

Does that make sense?  Clearly this is quite half-baked right now, but with further discussion I think we can fill in the details.

Ian.

--
Ian Clarke
CEO, Uprizer Labs LLC
Email: i...@uprizer.com
Ph: +1 512 422 3588

Michael

unread,
Aug 24, 2013, 6:38:50 PM8/24/13
to swarm-...@googlegroups.com, James Earl Douglas, i...@uprizer.com
Right, I think this general strategy is a good place to start from. We can measure RDPS as a distance, make the comparison you're suggesting and make whatever swaps are appropriate.  In my mind, finding a good mapping from real-world parameters to the space in which to make these comparisons is probably not a simple problem. Whether this space is relative to a specific objective, some data, bandwidth or a combination of the above, is an open question.

We may need to do some simulation.

The simulation may not be in Scala, but I can share code and findings as they are produced. If this makes sense, I can make this my current task. I've not 100% familiarized myself with the code but I believe I've got the general idea and can maybe start to do some work in this area sometime in the next 24 hours.

Ian Clarke

unread,
Aug 25, 2013, 11:08:17 AM8/25/13
to swarm-...@googlegroups.com, James Earl Douglas
Hey Michael,

You're welcome to work in whichever language you are most comfortable in, however if you do want to work in Scala I have just committed the beginnings of a simulation framework here: https://github.com/sanity/Swarm/tree/master/swarm-experiments/src/main/scala/swarm/experiments/clustering

I'm afraid my Scala is very rusty (much wailing and gnashing of teeth were involved just in this short piece of code), so if anyone wants to clean it up I will not take offense.  I must confess that Scala is a PITA after over a year of working mainly in Kotlin, but perhaps that will change as I grow more familiar with it again.

The advantage of working in Scala is that if we have a common interface to both simulated code and the actual Swarm platform then we can write our experimental clustering algorithms, simulate them, and then use the exact same code in real life.

Ian.


--
You received this message because you are subscribed to the Google Groups "Swarm Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swarm-discus...@googlegroups.com.
To post to this group, send email to swarm-...@googlegroups.com.
Visit this group at http://groups.google.com/group/swarm-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Michael

unread,
Aug 25, 2013, 3:28:40 PM8/25/13
to swarm-...@googlegroups.com, James Earl Douglas, i...@uprizer.com
Point taken - it'd be much easier if we could just use the code we simulated with. Scala it is. I'll share a link when I've started.

Michael

unread,
Aug 29, 2013, 9:15:49 AM8/29/13
to swarm-...@googlegroups.com, James Earl Douglas, i...@uprizer.com
Update: Still working on this. Just been busy and refreshing on Scala. Trying to have something to look at by Monday.

Ian Clarke

unread,
Aug 29, 2013, 1:56:52 PM8/29/13
to swarm-...@googlegroups.com, James Earl Douglas
That's great Michael.

So you are comfortable with my proposal above?  If anything isn't clear please let me know.

Ian.

Michael

unread,
Aug 30, 2013, 10:24:54 AM8/30/13
to swarm-...@googlegroups.com, James Earl Douglas, i...@uprizer.com
On the surface what you described looks good. There's a chance I may run into questions while working with the current Swarm code regarding design, but the idea of what you're saying makes sense to me. As you said, I'll let you know if/when I have those questions.

Mike

Seth Johnson

unread,
Aug 24, 2013, 7:11:43 PM8/24/13
to swarm-...@googlegroups.com, James Earl Douglas
Don't pay too much attention to me, I'm a dilettante here.  So take this as something that you might think about down the pike . . .

How does this mobile code know the data structure on all the nodes it might go to?  Is that just the programmer's job, per application?  So the app distributes data in a structure designed by the programmer, or the code divines the structure or works with it whatever it is, once it goes to the nodes?  As in, what's intended is that swarm would support either approach -- predefined structure or process free form data?

The dilettante wonders whether a uniform data structure would allow the search to be optimized.  Maybe you already have that . . .

Everything looks like a nail to me, and I often think my esoteric concept of a universal/uniform data structure might be interesting in lots of contexts.  If I understood the approach to arbitrary data structure requirements, I might have useful suggestions.  But again, don't take me too seriously!  :-)

Seth Johnson


--

Ian Clarke

unread,
Aug 30, 2013, 11:30:42 AM8/30/13
to swarm-...@googlegroups.com, James Earl Douglas
Sorry for the slow response, for some reason Google Groups is forcing me to manually approve emails :-/

On Sat, Aug 24, 2013 at 6:11 PM, Seth Johnson <seth.p....@gmail.com> wrote:
How does this mobile code know the data structure on all the nodes it might go to?  Is that just the programmer's job, per application?

In Swarm objects have a reference, similar to a pointer in C, but this reference also includes which physical computer the object resides on.  If you try to dereference an object that is on a remote computer, then Swarm will transparently migrate the computation to that computer.  From the programmer's perspective it appears as if all data is local, even though in reality it is distributed across multiple machines.  Does that make sense?
 
  So the app distributes data in a structure designed by the programmer, or the code divines the structure or works with it whatever it is, once it goes to the nodes?

The programmer decides on the structure of the data (or relies on a pre-provided collection type like a list, map, or set), just as with writing a normal computer program.  Swarm decides on the physical location of the data, which it determines by observing how the data is accessed.
 
  As in, what's intended is that swarm would support either approach -- predefined structure or process free form data?

The dilettante wonders whether a uniform data structure would allow the search to be optimized.  Maybe you already have that . . .

Swarm supports an arbitrary graph of data, just as with writing a computer program normally.  The only difference is that this graph might be spread across several physical machines, but Swarm tries to make this transparent, or as transparent as possible, to the programmer.
 
Everything looks like a nail to me, and I often think my esoteric concept of a universal/uniform data structure might be interesting in lots of contexts.  If I understood the approach to arbitrary data structure requirements, I might have useful suggestions.  But again, don't take me too seriously!  :-)

If you have an idea for a cool new datastructure, perhaps one that supports database-like querying, or a tuple-space, then there is nothing whatsoever to prevent you from implementing this on top of Swarm, but Swarm itself doesn't commit the programmer to any particular type of datastructure.

Does that make sense?

Ian. 

--
Ian Clarke
Reply all
Reply to author
Forward
0 new messages