Side effects

27 views
Skip to first unread message

Ben Harris

unread,
Sep 29, 2009, 4:27:50 AM9/29/09
to Swarm Discussion
Was wondering if you have done any thinking on side effects in the
calculations. It becomes very hard to handle concurrent data access if
you can't abort and restart operations (Due to side effects). A
decision needs to be made as to how to handle side effects before any
work can really commence on transactional memory.

Thoughts?

Ian Clarke

unread,
Sep 29, 2009, 1:33:53 PM9/29/09
to swarm-...@googlegroups.com
Hi Ben,

Its a good point, currently there are no restrictions on what can be done from within a Swarm thread, and definitely this would screw up (for example) optimistic locking for STM as you could have side-effects executed more than once.

I think initially we would just need to rely on the programmer not to call methods that have side-effects within transactions.  Later we could enforce this as part of a Swarm compiler plugin, which uses a whitelist of method calls in the standard library.

Do you think that is an adequate approach for now?

I'm assuming that the continuations plugin should be very useful for implementing STM - right?

Ian.
--
Ian Clarke
CEO, Uprizer Labs
Email: i...@uprizer.com
Ph: +1 512 422 3588
Fax: +1 512 276 6674

Ben Harris

unread,
Sep 29, 2009, 4:58:14 PM9/29/09
to Swarm Discussion
Well, because each shared piece of data is an object I though it would
be simple to implement a snapshot isolation system (with multiple
version of each piece of data hidden inside the single object).
Assuming there was a system implemented to avoid the "write skew"
issue, snapshot isolation would mesh pretty well with the whole swarm
system.

If you keep track of the ID of the oldest operation, then garbage
collection becomes easy (If the ID of the oldest operation to own/
receive the reference to this piece of data is older then the oldest
running operation, then nobody knows about us. Or something like
that).
Replication and redundancy is slightly simplified if you can tell if a
resource is current enough for you to use (ID of last update is >= our
ID. Because of snapshot isolation we don't care if it is newer
somewhere else, only that it is new enough).

Committing the operation is the next challenge. There could be a
simple two phase commit operation. Or a three phase. Or there could be
a more complicated system. I suppose starting with a inefficient but
simple system would be good, then scale up to a faster-more customised
system later. There are a number of papers available on STM systems
that support inevitable transactions (Transactions with side effects
need to be inevitable so they don't get aborted).

I suppose the idea is to pick a system that is not necessarily the
fastest, but one that doesn't need the user to know much about the
underlying architecture.

On Sep 29, 5:33 pm, Ian Clarke <ian.cla...@gmail.com> wrote:
> Hi Ben,
> Its a good point, currently there are no restrictions on what can be done
> from within a Swarm thread, and definitely this would screw up (for example)
> optimistic locking for STM as you could have side-effects executed more than
> once.
>
> I think initially we would just need to rely on the programmer not to call
> methods that have side-effects within transactions.  Later we could enforce
> this as part of a Swarm compiler plugin, which uses a whitelist of method
> calls in the standard library.
>
> Do you think that is an adequate approach for now?
>
> I'm assuming that the continuations plugin should be very useful for
> implementing STM - right?
>
> Ian.
>

Артём

unread,
Sep 29, 2009, 6:00:27 PM9/29/09
to Swarm Discussion
On 29 сен, 21:33, Ian Clarke <ian.cla...@gmail.com> wrote:
> I think initially we would just need to rely on the programmer not to call
> methods that have side-effects within transactions.  Later we could enforce
> this as part of a Swarm compiler plugin, which uses a whitelist of method
> calls in the standard library.

Since in any point of computation you have some locally available
data,
I think it is important to be able to do anything with it.
For example, suppose we have a String (we access it and so the
computation
is moved to the node with that String): then one might wish to run a
library
method on that String, for example, deflate it: now, deflating a
string
means a lot of side-effects code, but it shouldn't matter since that
String is local!

What I mean, is that the possible performance advantage of Swarm
approach
lies in that the current piece of data is local to the application and
thus it doesn't
need the traditional transactional and locking approaches the usual
distributed system
have to utilize.

As a side note:
Java isn't pure, STMs on pure Haskell and STMs in other non-pure
languages are two very different beasts.
Most people who this about STM adoption on languages with side-
effects, think along the lines of using low-level memory protection
stuff, and they go this way for a reason: you can't force a side-
effects environment to be pure without severely limiting its
usability.
If Swarm would try to implement a pure "kernel" without a way to go
out and work unsafely on a locally available data, then its usability
would be very thin.

Ben Harris

unread,
Sep 30, 2009, 4:03:01 AM9/30/09
to Swarm Discussion
On Sep 29, 10:00 pm, Артём <artem...@gmail.com> wrote:
> What I mean, is that the possible performance advantage of Swarm
> approach
> lies in that the current piece of data is local to the application and
> thus it doesn't
> need the traditional transactional and locking approaches the usual
> distributed system
> have to utilize.

If you aren't operating on a consistent version of the memory, your
application behaviour is undefined.

Node 1 has x = 0, Node 2 has y = 0
Operation 1 is
if(x != y) while(true)
Operation 2 is
y++; x++

If they both start at the same time then Operation 1 will get in to an
infinite loop. Concurrency is not a simple thing.

Also, what are the side effects of deflate on string? You should put
String in and get String out. What side effects are there? Scala has a
large number of immutable data structures. Also, when I say side-
effects, I am thinking more about file operations, network operations,
etc.

Артём

unread,
Sep 30, 2009, 5:30:38 AM9/30/09
to Swarm Discussion
On 30 сен, 12:03, Ben Harris <bharri...@gmail.com> wrote:
> If you aren't operating on a consistent version of the memory, your
> application behaviour is undefined.

Oh, so your ambition is to combine concurrency and strictly defined
behaviour? So far nobody has achieved that kind of thing in a
practical
world. Even Haskell programs have to use states and C libraries once
in a while.

> Node 1 has x = 0, Node 2 has y = 0
> Operation 1 is
> if(x != y) while(true)
> Operation 2 is
> y++; x++
>
> If they both start at the same time then Operation 1 will get in to an
> infinite loop.

Variables "x" and "y" is a state.
You programmed your operation such that in a certain state it will
go into infinite loop. It would do the same without concurrency.
Concurrency only introduces the possibility that state will be
different.

> Concurrency is not a simple thing.
>
> Also, what are the side effects of deflate on string? You should put
> String in and get String out. What side effects are there?

You and I know that there's virtually no side effects
(apart from linking into C library, allocating memory with malloc,
etc. - but we can still mark the operation as one without side effects
and nobody would know better),
however, the language does not know, it can not *prove* that calling
ZIP algoritm, which is essentially not pure, it does not violate
its safety constrains. So, if you would try to enforce safety on the
language level, you would essentially ban users from effectively using
existing stateful algorithms such as ZIP.

> Scala has a
> large number of immutable data structures. Also, when I say side-
> effects, I am thinking more about file operations, network operations,
> etc.

Exactly what I mean.
Here I want to call "unrar" program to unpack my string,
or call a C library on my data,
or such simple thing as scanning application logs on multiple servers,
using memory mapped buffers, and gather results using Swarm.

Practically everything in imperative programming world have side
effects, from the point of view of a purely-functional compiler. What
I say is that to be practical, Swarm should allow such computations on
local data. Trying to limit it to operating on a closed set of data
structures will make it next to useless at least to me.

Артём

unread,
Sep 30, 2009, 5:41:27 AM9/30/09
to Swarm Discussion
On 30 сен, 12:03, Ben Harris <bharri...@gmail.com> wrote:
> If you aren't operating on a consistent version of the memory, your
> application behaviour is undefined.

As a side note:

1) You have a consistent memory model in Java.
2) With a new C++ standard emerging, you have it for C++ as well.
3) You always have it for a certain processor family even in assembly.

http://lambda-the-ultimate.org/node/3554

Verification of side-effects programs IS possible, it's just it isn't
always *computationallly* possible.

Ben Harris

unread,
Sep 30, 2009, 9:31:06 AM9/30/09
to Swarm Discussion
On Sep 30, 9:30 am, Артём <artem...@gmail.com> wrote:
> On 30 сен, 12:03, Ben Harris <bharri...@gmail.com> wrote:
> > Node 1 has x = 0, Node 2 has y = 0
> > Operation 1 is
> > if(x != y) while(true)
> > Operation 2 is
> > y++; x++
>
> > If they both start at the same time then Operation 1 will get in to an
> > infinite loop.
>
> Variables "x" and "y" is a state.
> You programmed your operation such that in a certain state it will
> go into infinite loop. It would do the same without concurrency.
> Concurrency only introduces the possibility that state will be
> different.

It is a simple example. How about.
x=10
Operation 1
x = somelongoperation(x)
Operation 2
x = x + 2

If operation 1 starts before operation 2, and operation 2 finished
before operation 1. Then it is like operation 2 never happened. This
is unacceptable.

There can be a mechanism to remove the piece of data from the shared
pool (Taking exclusive control of it). But that blocks all other
operations that want to use the memory, so it should only be used in
special cases.

Maybe I'm being closed minded, I haven't studied the subject broadly.
Can you send me examples of distributed STM systems that don't have
some sort of serialisability system?

Артём

unread,
Sep 30, 2009, 9:56:02 AM9/30/09
to Swarm Discussion
On Sep 30, 5:31 pm, Ben Harris <bharri...@gmail.com> wrote:
> It is a simple example. How about.
> x=10
> Operation 1
> x = somelongoperation(x)
> Operation 2
> x = x + 2
>
> If operation 1 starts before operation 2, and operation 2 finished
> before operation 1. Then it is like operation 2 never happened. This
> is unacceptable.

Allright, the program "shoots itself in the foot" here.

> There can be a mechanism to remove the piece of data from the shared
> pool (Taking exclusive control of it). But that blocks all other
> operations that want to use the memory, so it should only be used in
> special cases.

That, IMO, defies the strong points of Swarm.
Here is a real-life example: we have an advertizing system which
writes logs; logs are very large, we don't want to move them, instead
we want to move the computation to them: go to the nodes containing
logs, scan them (using a propriate stateful algorithm, which is more
efficient than usual) and go back to the originating node with the
collected data.
A naive STM system will prevent us from accessing the logs
concurrently, because it can't check whether our proprietary
algorithms are safe: and for no reason, because as a developer of the
system, I know that the scanned portion of the logs is immutable.
Now, those large logs are exactly what makes Swarm approach
attractive: move computation to logs, not logs to computation (which
is impossible, one server doesn't have that much HDD space).
And, thanks, but we don't need double-locking, STM or whatever when we
scan the logs.

> Maybe I'm being closed minded, I haven't studied the subject broadly.
> Can you send me examples of distributed STM systems that don't have
> some sort of serialisability system?

My thinking is that Swarm is fundamentally different from STM systems:
STM's work with shared data, propagating that data between local
caches; Swarm works with local data, propagating computation to the
data.
I *fear*, that trying to implement STM on a language level (for Swarm)
would require a "safe kernel" or a DSL or a Monad, whoever you call
it, which will limit what you can do with the data to some primitive
operations on Ints and Doubles, maybe Maps. It will be a nice toy, but
useless from the point of efficient cloud computing.

Артём

unread,
Sep 30, 2009, 10:10:20 AM9/30/09
to Swarm Discussion
On Sep 30, 5:56 pm, Артём <artem...@gmail.com> wrote:
> And, thanks, but we don't need double-locking, STM or whatever when we
> scan the logs.

And if we need some modifiable local data to be accessed concurrently,
we can use anything from "synchronized" and
"Collection.synchronizedMap" to
database transactions if the local resource we are accessing happens
to be an embedded database.

Of course, there might be need for some automatical lock management
from Swarm, but not in any of the use-cases which came to mind.

As for your x = operation (x) example, I think there is another
separate matter which
is currently not obvious in Swarm and probably leads to confusion:
some data should go between nodes together with the computation!
For example, when I scan the local logs or a local database,
I need to gather my results somewhere.
In that case, I might do x = x +1 or map.put (...), but those
modifications
have to be thread-AND-computation-local.

Ben Harris

unread,
Sep 30, 2009, 10:25:05 AM9/30/09
to Swarm Discussion
On Sep 30, 2:10 pm, Артём <artem...@gmail.com> wrote:
> On Sep 30, 5:56 pm, Артём <artem...@gmail.com> wrote:
>
> > And, thanks, but we don't need double-locking, STM or whatever when we
> > scan the logs.

No, you don't. Files aren't part of the STM, they are local to the
nodes and are fixed. But if you are operating on the files and want to
update shared data, you need to know that your operation will be
inevitable (it will never be aborted and restarted due to a conflict
when using shared data).

And operations have their own local data, but for shared data there
needs to be some system to ensure that data is in a consistent state
from the point of view of the operation. Shared data can float around
the nodes, it isn't rooted at a particular node.

Артём

unread,
Sep 30, 2009, 10:39:55 AM9/30/09
to Swarm Discussion


On Sep 30, 6:25 pm, Ben Harris <bharri...@gmail.com> wrote:
> On Sep 30, 2:10 pm, Артём <artem...@gmail.com> wrote:
>
> > On Sep 30, 5:56 pm, Артём <artem...@gmail.com> wrote:
>
> > > And, thanks, but we don't need double-locking, STM or whatever when we
> > > scan the logs.
>
> No, you don't. Files aren't part of the STM, they are local to the
> nodes and are fixed. But if you are operating on the files and want to
> update shared data, you need to know that your operation will be
> inevitable (it will never be aborted and restarted due to a conflict
> when using shared data).

Logs *are* the shared data.
As long as they are accessible via Swarm, they are shared to the whole
Swarm cluster. Only, unlike the traditional systems, there is no need
to move the data to the computation, and so there is no need to invent
anything smart and scary like double-locking transactions.

There might be an embedded database on node B, which I access and
modify from the computation which started on node A. The data in the
embedded databse on node B *is* shared with node A.

You are talking about transactional semantics: rollback data in
embedded database B if computation fails later, whether in B or in
some other node.
That is, of course, next to impossible to achieve without taking full
control of the databases, files and everything, plus doing costly
multi-locking and such. But that's, IMHO, is not the primary point of
Swarm.

> And operations have their own local data, but for shared data there
> needs to be some system to ensure that data is in a consistent state
> from the point of view of the operation. Shared data can float around
> the nodes, it isn't rooted at a particular node.

Well, the original point of Swarm, according to the Ian's
presentation, was *not* to move the shared data. Apparently, you now
have two kinds of shared data in mind, the shared (or at least
accessible) local data, and Swarm-managed shared data. I don't see the
point: why move the computation if you go the traditional way of
moving the data after all, but it's okay as long as there remains the
accessible local data which isn't and can't be moved.

Ben Harris

unread,
Sep 30, 2009, 11:03:10 AM9/30/09
to Swarm Discussion
> Well, the original point of Swarm, according to the Ian's
> presentation, was *not* to move the shared data. Apparently, you now
> have two kinds of shared data in mind, the shared (or at least
> accessible) local data, and Swarm-managed shared data. I don't see the
> point: why move the computation if you go the traditional way of
> moving the data after all, but it's okay as long as there remains the
> accessible local data which isn't and can't be moved.

I disagree. There is talk in the presentation about trying to cluster
the data so to avoid moving the operations around too much. The idea
of swarm is to have very scalable apps. If the total amount of shared
data is too big for a single node, place it on two.

Your current example problem can already be handled with Scala Actors.
You aren't adding nodes to scale the application. You are adding nodes
because they have data you need to operate on. Unless I'm missing
something.

I'm assuming it is a lot like current distributed system, but instead
of pulling the STM data to you, you go to it. But everything else is
the same. You can have redundancy and replication but it is all
transparently handled by the system. Nodes can be added, removed,
expanded, and the data and operations magically shift around to make
the best use of what is available. It will get really scary if there
is some JavaRebel like system that allows new and updated classes to
be pushed to one node, then automatically spread to other nodes.

But, again, maybe I'm seeing what I want to see. I spent a large
amount of my spare time at the start of this year looking for the
above...

Артём

unread,
Sep 30, 2009, 11:48:09 AM9/30/09
to Swarm Discussion
On 30 сен, 19:03, Ben Harris <bharri...@gmail.com> wrote:
> Your current example problem can already be handled with Scala Actors.
> You aren't adding nodes to scale the application. You are adding nodes
> because they have data you need to operate on. Unless I'm missing
> something.

Any CPS-style program can be rewritten into a non-CPS one; that's
what the Scapa CPS plugin does, after all.

> I'm assuming it is a lot like current distributed system, but instead
> of pulling the STM data to you, you go to it. But everything else is
> the same.

I've been looking into distributed databases and filesystems lately.
None that I know of have a transactional semantics,
it's just plain too slow to implement across networks.

Heh, I still remember reading Oracle books which humored about how
its internet protocols are slow. Funny that it's the only database to
offer
double-locking.

Артём

unread,
Sep 30, 2009, 11:58:27 AM9/30/09
to Swarm Discussion
On 30 сен, 19:48, Артём <artem...@gmail.com> wrote:
> Heh, I still remember reading Oracle books which humored about how
> its internet protocols are slow. Funny that it's the only database to
> offer
> double-locking.

Of course, my acquaintance, who is an Oracle evangelist, told me the
internet protocols had seen much improvement lately (in 8th and more
in 9th series, probably), but that doesn't mean double-locking became
any faster, it is inherently slow.

Артём

unread,
Sep 30, 2009, 12:45:04 PM9/30/09
to Swarm Discussion
On 30 сен, 19:48, Артём <artem...@gmail.com> wrote:
> > I'm assuming it is a lot like current distributed system, but instead
> > of pulling the STM data to you, you go to it. But everything else is
> > the same.
>
> I've been looking into distributed databases and filesystems lately.
> None that I know of have a transactional semantics,
> it's just plain too slow to implement across networks.

There seems to be a gap between clustering software and distributed
software.
Clustering software is executed on local networks, when servers are
standing
next to each other and can share a lot of data via the network "bus".
Clustering software will behave much worse if one would try to use it
in a
distributed fashion, e.g. having nodes on different server providers
(different datacenters in case of Google's BigTable).

Clustering software tries to scale traditional multy-threading
approaches
to multiple processors and servers. Distributed software usually goes
for a
different approach, which would work in a distributed setting:
Google's Bigtable splits the data into entity groups:
it allows one to make transactional modifications on a data which
resides
in a single datacenter, but not across datacenters.
A lot of other distributed filesystems and databases simply drop the
Atomicity, Consistency and Isolation parts: you write the data,
it is locally saved and then distributed into other nodes as soon as
possible,
but you can't guarantee that data will arrive from node A into node B
because there can be a network failure or a node can be down.
If a distributed system would try for ACID compliance, it would have
to
wait for all the relevant nodes to respond, and it might take lots of
time
and lots of false negatives in a real-life distributed system.

Still, it is interesting how Terracota batches transactional updates
and selectively distribute them:
http://www.theserverside.com/tt/articles/article.tss?l=TerracottaScalabilityStory
but their cluster benchmark is different from that of a distributed
system.

Ben Harris

unread,
Sep 30, 2009, 12:55:50 PM9/30/09
to Swarm Discussion

Артём

unread,
Sep 30, 2009, 1:03:16 PM9/30/09
to Swarm Discussion
Okay, looking up http://en.wikipedia.org/wiki/Distributed_computing it
seems to be any computing which involves network communications,
including the http://en.wikipedia.org/wiki/Cluster_%28computing%29 and
http://en.wikipedia.org/wiki/Cloud_computing

But http://en.wikipedia.org/wiki/Cloud_computing, which is service-
oriented, with BitTorrent as an example, looks closer to my use-case,
I think.

Артём

unread,
Oct 1, 2009, 2:47:42 PM10/1/09
to Swarm Discussion
I noticed Akka uses something called ZooKeeper
in their plan to implement distributed transactions for STM:
http://wiki.github.com/jboner/akka/modules-the-high-level-view

Артём

unread,
Oct 1, 2009, 5:05:31 PM10/1/09
to Swarm Discussion
On 30 сен, 18:03, Ben Harris <bharri...@gmail.com> wrote:
> Your current example problem can already be handled with Scala Actors.

One of the things I want is automatic discovery,
that's why I'm going to try https://shoal.dev.java.net/ soon.

Артём

unread,
Oct 9, 2009, 3:00:31 PM10/9/09
to Swarm Discussion
On 30 сен, 19:45, Артём <artem...@gmail.com> wrote:
> Google's Bigtable splits the data into entity groups:
> it allows one to make transactional modifications on a data which
> resides in a single datacenter, but not across datacenters.

They actually work on implementing distributed transactions:
http://www.youtube.com/watch?v=IOtBlzsx0m8
Reply all
Reply to author
Forward
0 new messages