Where's the data?

Patrick Wright

unread,

Oct 23, 2009, 1:06:20 PM10/23/09

to swarm-...@googlegroups.com

Hi

I'm wondering if anyone has put thought into how any given node in a
Swarm finds out where a given data item or data set is located at the
moment?

I think at a minimum there are two problems to address
- how, within a continuation, we identify which data we want to access
- how to have an up-to-date view on each node of where data is located

As far as identification, the simplest thing I can think of to start
with is that we identify data by a class name (type) and a single key.
The key's value must be unique to that type across the cluster, and is
opaque (or blind).

There are a bunch of ways to propagate the knowledge of where the data
is. For example, we could have a handshake when nodes start up,
exchange their keys and type lists, and a broadcast when data moves.
But I haven't worked out anything concrete yet.

Thoughts?
Patrick

Rick R

unread,

Oct 23, 2009, 5:59:46 PM10/23/09

to swarm-...@googlegroups.com

This might be getting too peer2peer-ish, but we could take a page from the ad-hoc routing book. If a node needs to access a piece of data, he sends a SASE[1] to his neighbors who then forward it on to the owner, if they don't know, then they forward it on to their neighbors. When the node is found, its send the appropriate data item to the originator of the request.

I use the term neighbors very loosely, but something tells me that there is a way to cut down on messages if we don't broadcast every request for data.

In addition, I like the reactive (need to know) model for knowing where data resides instead of broadcasting every change.

The movement of data from one node to another is addressed by the fact that the previous owner of the data should know exactly where the new owner is.

[1] Self Addressed Stamped Envelope

Peter Volk

unread,

Oct 23, 2009, 7:15:25 PM10/23/09

to swarm-...@googlegroups.com

With one small difference. Swarm should ship the function not the data
;). Also I think that there is a fundamental difference in the type of
data that you are looking at. if you look at the peer2peer you are
looking for one piece of data where in swarm you will have lists,
arrays etc. Then you will run into a termination problem. For Lists
you will have to ask the complete network before you get the complete
list since every node can create have ownership of a new data. For
lists: you may create data on one node and then add it to a list that
resides on a different server. I think the problem of finding where
the data is is different per data type. E.g. for a simple numerical
value there should only be exactly one in the network. But if you look
at a list sets or arrays then you start having a few "challenges".

There is quite a bit of related work on those topics in the field of
distributed object management and distributed database systems. I know
that there is some work on distributed reference management, too.
Would need to undust my old file for that. I have seen a bit of work
on it but would not call myself an expert.

So my proposal would be to think about the "where is the data?"
problem as "how do I manage distributed objects with referenced across
servers". Once you know how you will manage the references (since in
java everything is simply an object aka reference) you will know how
to find your data.

Cheers,
Peter

Rick R

unread,

Oct 23, 2009, 8:18:11 PM10/23/09

to swarm-...@googlegroups.com

My intuition is that shipping the function won't always be the most optimal approach, nor will it be the most intuitive. I can't prove that, however ;)

Ian Clarke

unread,

Oct 23, 2009, 8:51:32 PM10/23/09

to swarm-...@googlegroups.com

On Fri, Oct 23, 2009 at 4:59 PM, Rick R <rick.ri...@gmail.com> wrote:
> This might be getting too peer2peer-ish, but we could take a page from the
> ad-hoc routing book. If a node needs to access a piece of data, he sends a
> SASE[1] to his neighbors who then forward it on to the owner, if they don't
> know, then they forward it on to their neighbors. When the node is found,
> its send the appropriate data item to the originator of the request.

Well, as Peter points out, it would be continuation that would be
shipped to the data, the data doesn't move.

But apart from that, I like your idea. Here is a bit more depth on
how it could be implemented:

Every piece of data in Swarm has a uid, perhaps a 128 bit number
randomly chosen on object creation (meaning that the probability of a
collision is 1/(2^64) - ie. very low).

References to data contain the UID, and optionally the last known
location of the data.

If you need to ship a continuation, you can then use the last known
location. If its not there, then that peer tries to get it to its
destination.

If you have no last known location, or perhaps if you've already tried
to route this continuation and its found its way back to you, then you
do a broadcast to all nodes to find out where it is. The response
should also be a broadcast and all nodes that receive it should update
their last known location based on it.

The idea is that these broadcasts should be rare.

Thoughts?

Ian.

--
Ian Clarke
CEO, Uprizer Labs
Email: i...@uprizer.com
Ph: +1 512 422 3588

Reply all

Reply to author

Forward