Adding fault tolerance to swamp as a school project

Jeff

unread,

Mar 24, 2008, 12:46:43 AM3/24/08

to SWAMP Project Users

Daniel,

I'm currently taking a (distributed) fault tolerance course as part of
my masters coursework. As our final project for the course we were
given some options, one of which is to evaluate "some research
software system" and to write up a performance report. We could also
do some sort of programming project or lastly, some custom research
idea related to a thesis (if I had one.)

I'm hoping to evaluate the fault tolerance properties of swamp. We're
considering using swamp as part of a project at work, so I see doing
something like this as benefiting both work and school--two birds with
one stone.

Now, this is just a guess, but I'm pretty sure swamp doesn't worry too
much about fault tolerance at this point. I mean no offense by my
presumption, but for starters there is no persistence of the token
returned by a script submission. If the server were to crash, the
data would be lost to the client since there is no longer a reference
to it via the token (but the data would actually be living in the
publish space.) Please correct me if I'm wrong on that last point. I
think an evaluation of swamp's fault tolerance would be a fantastic
exercise.

There's a catch, however. It's likely that I won't be able to achieve
the 15-20 pages required for the writeup based on my evaluation alone,
but who knows. That being said, I'd like to possibly add fault
tolerance to swamp via custom code. I could always develop this code
independent of swamp proper, but if it turns out to add value to the
project I'm hoping that I could contribute it back to you.

I'd like your opinion on a few things.
1) Do you think this fault tolerance evaluation is a worthwhile
endeavor?
2) Do you see this as adding value to swamp if some code were
contributed?

Thanks.

Jeff

Daniel

unread,

Mar 24, 2008, 8:18:19 PM3/24/08

to SWAMP Project Users

> Now, this is just a guess, but I'm pretty sure swamp doesn't worry too
> much about fault tolerance at this point. I mean no offense by my
> presumption, but for starters there is no persistence of the token
> returned by a script submission. If the server were to crash, the
> data would be lost to the client since there is no longer a reference
> to it via the token (but the data would actually be living in the

That's absolutely correct. In a previous version, all job state
(submission, progress, etc) was tracked in a SQLite database, but the
concurrency issues of that implementation were pretty horrifying for
performance. That said, there's definitely room to store at least the
high-level job state for fault tolerance issues. Putting only the
high-level stuff in the db should be fine for performance. You could
do periodic checkpoints of the finer-grained job state, as well.

> publish space.) Please correct me if I'm wrong on that last point. I
> think an evaluation of swamp's fault tolerance would be a fantastic
> exercise.

Sure, feel free. Fault tolerance hasn't been a huge priority, because
it falls in the category of something desired only when the larger
goal (getting results from (remote) data faster) is established and
accomplished.

> There's a catch, however. It's likely that I won't be able to achieve
> the 15-20 pages required for the writeup based on my evaluation alone,
> but who knows. That being said, I'd like to possibly add fault
> tolerance to swamp via custom code. I could always develop this code
> independent of swamp proper, but if it turns out to add value to the
> project I'm hoping that I could contribute it back to you.

Sure, that sounds good. FYI, I'm working on a major change to the
underlying execution engine, so that part is a little unstable right
now, but the front-end api seems workable. I'll merge it as soon as
it's workable.

> I'd like your opinion on a few things.
> 1) Do you think this fault tolerance evaluation is a worthwhile
> endeavor?

Of swamp's fault tolerance in particular? Probably. I'd be
interested in what you find. At this point, it hasn't been a big deal
to just clean everything up and restart the jobs, but I could see how
people could find it a hassle.

> 2) Do you see this as adding value to swamp if some code were
> contributed?

Yes, sure. Fault tolerance would be a great thing in SWAMP. And the
prettier and better documented the code is, the less chance I'll have
to accidentally break it while implementing feature X, Y, or Z. (That
goes for my own code, as well.)

Take care,
-Daniel

Jeff

unread,

Mar 30, 2008, 8:10:12 PM3/30/08

to SWAMP Project Users

I have a few requests or questions.

Is the "publications" part of the google code wiki up to date? I'm
looking for as much reading material as I can. Also, do you have any
links to swamp reading material that hasn't been published yet?

As far as your execution engine changes, is there some way you could
describe those changes somewhere even if they aren't done yet? I must
base this writeup on some revision of the code, so I'm pretty sure
it's going to be on the current 0.1 release. That version of the
execution engine may be part of my evaluation, so if your changes end
up fixing or altering any fault tolerance concerns of mine, I'd like
to document that as well.

Thanks!

Jeff

Daniel

unread,

Apr 4, 2008, 2:09:22 PM4/4/08

to SWAMP Project Users

Hi Jeff,

Sorry it took a while to respond:

> Is the "publications" part of the google code wiki up to date? I'm
> looking for as much reading material as I can. Also, do you have any
> links to swamp reading material that hasn't been published yet?

It's reasonably up-to-date. There's another conference paper that's
pending that I'll send off to you over email. We're working on a
paper on using scripts as languages, but it's high-level and tries to
stay away from talking about implementation.

> As far as your execution engine changes, is there some way you could
> describe those changes somewhere even if they aren't done yet? I must
> base this writeup on some revision of the code, so I'm pretty sure
> it's going to be on the current 0.1 release. That version of the
> execution engine may be part of my evaluation, so if your changes end
> up fixing or altering any fault tolerance concerns of mine, I'd like
> to document that as well.

I'm getting ready to merge the branch (today?), which includes code
for partitioning things into clusters so that scheduling can be done
at a higher granularity and reduce communication and scheduling
latencies. It's pretty rough right now, but it seems to work, so I'm
merging it. The partitioning algorithm is just a basic one that I
thought up, after finding that the more generic flow-network-
partitioning algorithms had different goals than mine.

As far as the API is concerned, the front-end client-server is still
the same, but the lower level uses threading for local execution and
callbacks for remote execution. This pretty much eliminates the
annoying latency issues from periodic polling (plus the question of
how often to poll). I should mention one optimization to clustered-
scheduling, which is that clusters may be dispatched when all their
inputs are ready, which may be before the parent clusters complete.
This speeds things up for the (common?) case where a child cluster
depends on only a subset of its parent's multiple outputs.

I'll try to respond faster in the future. Feel free to ping me if I
take too long to respond.
-Daniel

Reply all

Reply to author

Forward