Call for masters thesis ideas (possibly related to Clojure)

58 views
Skip to first unread message

Patrick Kristiansen

unread,
Dec 18, 2009, 7:35:55 AM12/18/09
to Clojure
Hi

We're two students that have been working with concurrent programming
languages (Erlang and Clojure), and while both languages are very
interesting, we would like to work on something related to Clojure in
our masters thesis.

I once asked on #clojure for ideas, and Rich Hickey suggested looking
into predicate dispatch and it is one idea that we are considering. We
have also considered working on distributed Clojure, but I don't know
if there is already an effort going on in that regard?

Do you have any other suggestions? We'd be really thankful for any
suggestions.

Thanks in advance.

-Patrick

ajay gopalakrishnan

unread,
Dec 18, 2009, 10:03:57 AM12/18/09
to clo...@googlegroups.com
Hi,

An idea I was interested in (more from an learning opportunity perspective than thesis) is provided more Persistent data structures. As of now, we have the the Array mapped Hash tree based structure that works well for Vectors, Maps and Graphs too. What I feel is missing is Tree based structures. Although, you can simulate them using Vectors as children indices, I think if we directly implement the Persistent Data structure of Binomial/Fibonacci Heaps and BSTs with all their operations (insert,delete,search,meld), it would be much faster. I could be wrong though!

Anyways ... my 2 cents.

Thanks,
Ajay G.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Sean Devlin

unread,
Dec 18, 2009, 10:23:04 AM12/18/09
to Clojure
Most of the distributed Clojure work is based on running Clojure on a
distributed JVM package.

Here's a link on Clojure + Terracotta. I have heard mention of other
solutions as well, I just don't recall them.

http://groups.google.com/group/clojure/browse_thread/thread/9a9690947617906/6bc50fb45ca56e7d

Sean

On Dec 18, 7:35 am, Patrick Kristiansen

Tassilo Horn

unread,
Dec 18, 2009, 1:35:33 PM12/18/09
to clo...@googlegroups.com
ajay gopalakrishnan <ajgo...@gmail.com> writes:

Hi Ajay,

> An idea I was interested in (more from an learning opportunity
> perspective than thesis) is provided more Persistent data structures.
> As of now, we have the the Array mapped Hash tree based structure that
> works well for Vectors, Maps and Graphs too.

In the last few month I didn't find any time to play around with clojure
or follow discussions on this list, so could you give me a pointer to
persistent graph structures? That would be an area where I would be
utmost interested.

Bye,
Tassilo

Niels Mayer

unread,
Dec 18, 2009, 4:18:27 PM12/18/09
to patrick.k...@gmail.com, clo...@googlegroups.com
(0)  Create a toolkit to run multiple parallel, tightly communicating clojure apps on google-app engine, simulating a single, long-running, multithreaded JVM instance that does not appear, to the user, to be limited by the constraints of GAE's java implementation (e.g. single threading, shared refs work despite being distributed); at the same time, the toolkit would minimize resources consumed in GAE, persisting threads/continuations waiting for data, and heuristically determining lowest cost for long-term storage versus memory and runtime consumption.

THE BIG CAVEAT

> Two unusual aspects of the Google AppEngine environment create pretty major
> constraints on your ability to write idiomatic Clojure.

> First, an AppEngine application runs in a security context that doesn't
> permit spawning threads, so you won't be able to use Agents, the
> clojure.parallel library, or Futures.

> Second, one of the most exciting features of AppEngine is that your
> application will be deployed on Google's huge infrastructure, dynamically
> changing its footprint depending on demand. That means you'll potentially be
> running on many JVMs at once. Unfortunately this is a strange fit for
> Clojure's concurrency features, which are most useful when you have precise
> control over what lives on what JVM (and simplest when everything runs on
> one JVM). Since shared references (Vars, Refs, and Atoms) are shared only
> within a single JVM, they are not suitable for many of their typical uses
> when running on AppEngine. You should still use Clojure's atomic references
> (and their associated means of modification) for any state that it makes
> sense to keep global per-JVM, since there may be multiple threads serving
> requests in one JVM. But remember JVMs will come and go during the lifetime
> of your application, so anything truly global should go in the Datastore or
> Memcache.

(1)  a clojure implementation of Yahoo's PNUTs, using STM's and all the cool facilities clojure provides: http://research.yahoo.com/files/pnuts.pdf (interesting to have a writeup of a real-world impl alongside comparisons to Google Bigtable and Amazon Dynamo)

We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications.

The foremost requirements of web applications are scalability, consistently good response time for geographically dispersed users, and high availability. At the same time, web applications can frequently tolerate relaxed consistency guarantees.

For example, if a user changes an avatar ... little harm is done if the new avatar is not initially visible to one friend .... It is often acceptable to read (slightly) stale data, but occasionally stronger guarantees are required by applications.

PNUTS provides a consistency model that is between the two extremes of general serializability and eventual consistency ... We provide per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order .... The application [can] indicate cases where it can do with some relaxed consistency for higher performance .... [such as reading] a possibly stale version of the record.
 
Some interesting commentary from http://glinden.blogspot.com/2009/02/details-on-yahoos-distributed-database.html

..................

When reading the paper, a couple things about PNUTS struck me as surprising:

First, the system is layered on top of the guarantees of a reliable pub-sub message broker which acts "both as our replacement for a redo log and our replication mechanism." I have to wonder if the choice to not build these pieces of the database themselves could lead to missed opportunities for improving performance and efficiency.

Second, as figures 3 and 4 show, the average latency of requests to their database seems quite high, roughly 100 ms. This is high enough that web applications probably would incur too much total latency if they made a few requests serially (e.g. ask for some data, then, depending on what the data looks like, ask for some other data). That seems like a problem.

Please see also my August 2006 post, "Google Bigtable paper", which discusses the distributed database behind many products at Google.

Please see also my earlier post, "Highly available distributed hash store at Amazon", on the distributed database behind some features at Amazon.com.

Please see also my earlier posts, "Cassandra data store at Facebook" and "HBase: A Google Bigtable clone".

Update: One of the developers of PNUTS commented on this post, pointing out that PNUTS performance is much better in practice (1-10ms/request) when caching layers are in place and making a few comparisons to Bigtable.

..................


-- Niels
http://nielsmayer.com

Joost

unread,
Dec 18, 2009, 5:06:53 PM12/18/09
to Clojure
On 18 dec, 13:35, Patrick Kristiansen <patrick.kristian...@gmail.com>
wrote:

Erm, what's your master? I'll assume CS.

Personally, I'm interested in whether complete thread abstraction that
makes "threads" as light-weight as possible, but also the only way to
do concurrency (like Erlang provides with its "processes") is really
the best way to model concurrent programs. I'm over 95% sure that
native threads really are not the best way to model in-process
concurrency for most programs, simply because of all the overhead that
you incur especially when dealing with massively concurrent long-
running thead-like-things - since you both know erlang, you will know
what I'm talking about - that sort of approach really doesn't work in
clojure, that's why clojure uses thread pools for agents etc.

But maybe a new and lightweight in-kernel thread model is an
interesing subject? Just asking :)

Good luck with your thesis,
Joost Diepenmaat.

Patrick Kristiansen

unread,
Dec 18, 2009, 7:20:21 PM12/18/09
to Clojure
On Dec 18, 11:06 pm, Joost <jo...@zeekat.nl> wrote:
> Erm, what's your master? I'll assume CS.

Well it's software engineering, but close enough :). I should have
mentioned that.

> Personally, I'm interested in whether complete thread abstraction that
> makes "threads" as light-weight as possible, but also the only way to
> do concurrency (like Erlang provides with its "processes") is really
> the best way to model concurrent programs. I'm over 95% sure that
> native threads really are not the best way to model in-process
> concurrency for most programs, simply because of all the overhead that
> you incur especially when dealing with massively concurrent long-
> running thead-like-things - since you both know erlang, you will know
> what I'm talking about - that sort of approach really doesn't work in
> clojure, that's why clojure uses thread pools for agents etc.
>
> But maybe a new and lightweight in-kernel thread model is an
> interesing subject? Just asking :)

Thanks for your suggestion. It's an interesting idea, and we've been
considering something along those lines. I don't know if Erlang's
processes are even more lightweight than "green threads". According to
Wikipedia [1], Linux actually performs really well in terms of context
switching OS threads compared to green threads. But I can't find the
paper Wikipedia cites, and it seems not to be downloadable from ACM
[2].

In any case, we will take this suggestion into consideration.

[1] http://en.wikipedia.org/wiki/Green_threads#Performance
[2] http://portal.acm.org/citation.cfm?id=603551

Joost

unread,
Dec 18, 2009, 8:07:42 PM12/18/09
to Clojure
On 19 dec, 01:20, Patrick Kristiansen <patrick.kristian...@gmail.com>
wrote:

> Thanks for your suggestion. It's an interesting idea, and we've been
> considering something along those lines. I don't know if Erlang's
> processes are even more lightweight than "green threads". According to
> Wikipedia [1], Linux actually performs really well in terms of context
> switching OS threads compared to green threads. But I can't find the
> paper Wikipedia cites, and it seems not to be downloadable from ACM
> [2].

The problem seems to be two part:

1. Starting new "native thread" takes a relatively long time

2. Native threads means a fairly significant memory and scheduler
consumption. In Java-based threads, that means for example that a
20,000 connection server is pretty much the maximum that's possible in
that model. In Erland, 20,000 concurrent connections is what you
should expect from an unoptimized system. This is at least partly due
to kernel design, but also due to language design. A purely side-
effect free language is much more capable of sharing data. In *that*
regard, it's probably easier to optimize Erlang than Clojure.

But in any case, massive networking is the key - In clojure, there is
plenty of room to improve in that regard. In Erlang probably too, but
erlang has no really interesing memory sharing mechanisms.

ajay gopalakrishnan

unread,
Dec 18, 2009, 7:52:35 PM12/18/09
to clo...@googlegroups.com
Put

Comparative performance evaluation of Java threads for embedded applications: Linux thread vs. Green thread

Google, and click the second link that says [PDF]. It is for free.

Niels Mayer

unread,
Dec 18, 2009, 9:43:50 PM12/18/09
to clo...@googlegroups.com
Re:

Update: One of the developers of PNUTS commented on this post, pointing out that PNUTS performance is much better in practice (1-10ms/request) when caching layers are in place and making a few comparisons to Bigtable.

Note this very interesting conversation  (below) which IMHO has potential for STM implementation in clojure (excerpt: "PNUTs supports row-level transactions too-- they are done through a get followed by test-and-set. This is nothing new: it is called optimistic concurrency control and has been around in database literature for ages, and is also used by BigTable.").

...........................

Can and would clojure help and simplify the construction of a framework for high-volume, high-availability, distributed web apps? Has something like PNUTs been implemented for Clojure and Java Clouds?
............................

Brian Cooper said...

Hi folks,

I'm another member of the PNUTS group at Yahoo! This has been a very interesting discussion; I'm glad you folks are as interested in this stuff as we are.

Just to reiterate a few points that Utkarsh brought up:

- There's no free lunch for performance, and if you want consistency some writes will have to go cross-datacenter, increasing the average latency. Cross-datacenter communication is required because of our mastership protocol. Consider a user who's record is mastered in California. If she flies to Europe, or a network problem causes her request to be redirected to a datacenter on the East coast, then her write will originate in a non-master datacenter and be forwarded back to the master for that record. In practice this happens 10-20% of the time, just because of the way web requests happen.

Even if you had Paxos, and managed to put the "local" participants in the same datacenter, occasionally a write would originate in the non-"local" datacenter and pay the cross-datacenter latency to find enough members of the quorum. So this cost is really unavoidable.

- You could weaken the consistency, to something like "eventual consistency" (write anywhere and resolve conflicts later) or even "best effort" (write anywhere and don't worry about conflicts) and avoid ever paying the cross-datacenter cost. And in fact it is possible to turn off mastership in PNUTS. But then you need a resolution protocol, and until conflicts are resolved inconsistent copies of the data are visible to readers. So again there is no free lunch.

- Anonymous is write that this system is not as optimized for scans a la MapReduce. In fact, you make different architectural decisions if you are optimizing for low-latency updates to a geographic database than if you are optimizing for scanning for MapReduce within a single datacenter. We have been able to run Hadoop jobs over PNUTS and get performance that is pretty good, just not as good as a native store (HDFS) optimized for MapReduce. So if you want to transactionally update data with very low latency and occasionally run MapReduce, you can use PNUTS; if you want to always run MapReduce but don't need a lot of high performance updates, use HDFS.

- As Utkarsh has said, the hardware used for the paper is not as good as production hardware (e.g. no battery-backed write cache, and other limitations). We hope to publish some numbers from live workloads on the production system soon.

..................
 
 

jng27

unread,
Dec 19, 2009, 3:20:05 AM12/19/09
to Clojure
What about adding circular dependency resolution to the compiler
between Clojure -> Java code ?

I know, I know - it's not very sexy but it would go a long way for
companies with larger legacy Java code bases that may be considering
moving to Clojure. Being able to replace Java code 'one step at a
time' would be invaluable in that area and it also makes Clojure
easier to 'sell'.

On Dec 18, 4:35 am, Patrick Kristiansen

Micklat

unread,
Dec 19, 2009, 5:16:52 AM12/19/09
to Clojure
How about static analysis? Plenty of interesting problems there. For
example:

1. detect incorrect uses of transients (those that would certainly
lead to exceptions) and report them during compile-time.

2. related to (1), but different: detect places where the compiler can
implicitly convert collections to transients and back, without losing
correctness, and use that as a speed-up.

3. soft typing, either to speed up code or to detect incorrect code
early. Such things have been done with Scheme.

I believe Clojure has certain advantages when working on static
analysis:
* you can call the reader and macro expander during analysis, which
greatly simplifies the language to be analyzed.
* the core language is relatively simple
* the compiler source is freely available and is written in a
relatively high-level language (java and not c), so it (should) be
relatively easy to get into.

Relatedly, it might be interesting to create formal semantics for
Clojure, and interface it with a proof assistant, enabling proofs to
be written and checked about the behavior of clojure programs.

In that context, one of the opportunities provided by lispiness is
that macros can help write the proof, by automatically creating
propositions about the code that they return. For example, (dotimes [i
n] body) can be modified to attach these propositions to its output,
using an appropriate proof language:
- i>=0
- i<(the result of) n
- the loop terminates if its body terminates

That's just scratching the surface. That field is deep and wide, and a
complete implementation is probably beyond the scope of a master's
theses. Still, parts of it can be done.

Patrick Kristiansen

unread,
Dec 19, 2009, 7:58:27 AM12/19/09
to Clojure
On Dec 19, 1:52 am, ajay gopalakrishnan <ajgop...@gmail.com> wrote:
> Put
>
> *Comparative performance evaluation of Java threads for embedded
> applications**: Linux thread vs. Green thread

Your Google search skills are obviously beyond ours. :) I've found it
now.

Jules

unread,
Dec 19, 2009, 11:04:12 AM12/19/09
to Clojure
Predicate dispatch would be an interesting topic. The #1 problem with
predicate dispatch is method ordering. Predicate dispatch as described
in the orignal paper [http://www.cs.washington.edu/homes/mernst/pubs/
dispatching-ecoop98-abstract.html] decides what order to used based on
implication between predicates. For example:

def foo(x) where x is Animal = A
def foo(x) where x is Chicken = B
def foo(x) where x is Bird = C

If you call foo(new Crow) you want the method with "where x is Bird"
to be called. A Crow is not a Chicken, so we can rule out the Chicken
method. But a Crow is also an Animal, so how does the dispatch decide
that it calls the Bird method and not the Animal method? The answer of
the authors of the predicate dispatch paper is that you determine
implication between predicates. If x is a Bird, that implies that x is
an Animal. So we test if x is a Bird and only after that we consider
the Animal method.

This is good but unfortunately checking implication between predicates
is undecidable. For example if we define this:

def bar(n) where 5^n == 5 mod n = A
def bar(n) where prime?(n) = B

How is the static analysis going to determine that the second
predicate implies the first? You can solve this case by requiring the
programmer to provide implication information to the compiler, like
this:

implication prime?(n) => 5^n == 5 mod n

As you can see this isn't nice. In my opinion we need a better way to
do method ordering. I've been thinking about this and my proposal is
to use the definition order in the source code to decide which methods
to check first: try methods that are defined last first.

def baz(x) where p(x) = A
def baz(x) where q(x) = B
def baz(x) where r(x) = C

We first check for r(x) and if it's true we execute C. If it's false
then we check if q(x) is true and execute B. If q(x) is false we check
p(x) and execute A.

The implementation of this kind of predicate dispatch is trivial. You
don't need a complicated implication checker. It's just a simple
macro. But the question is whether this simple method ordering rule
works in practice. With this order you can do everything that you can
do with a class hierarchy (in Java for example). It also subsumes
Clojure's hierarchies by using a predicate this is itself a method. I
think it's interesting to explore this further to see if this order
still works well for more advanced uses of predicate dispatch, and to
explore alternative orderings.

Some other research ideas:

- Functional reactive programming in Clojure. A problem with FRP is
that you have to invert dependencies. Instead of specifying for each
cause what its effects are, you have to specify for each effects what
its possible causes are. Can you solve this? The problem is much
easier if you don't try to prevent glitches as other FRP systems do.
Is a system without glitch prevention useful, is introducing glitches
for removing cause and effect inversion a good trade off?
- Type/value inference with abstract interpretation. Can you write an
abstact interpretation to infer possible values of a variable? Is this
useful in creating a better compiler. Can the information be used to
create better editors (intellisense?).
- In a statically typed language like Haskell & Scala you can write a
generic sum function that sums a list of things. You cannot do this in
any dynamic language that I know of. You want sum to work on any list
of things that support +. Suppose that numbers support + and strings
support + (maybe they shouldn't but bear with me for this example). sum
([1,2]) = 3. sum(["ab","cd"]) = "abcd". Now comes the problem. What is
sum([])? For numbers it should be 0, for strings it should be "". In a
statically typed language you can use the type of the list to
determine which value to return. How can we write a generic sum
function in a dynamic language?

Jules

On Dec 18, 1:35 pm, Patrick Kristiansen

Patrick Kristiansen

unread,
Dec 19, 2009, 9:22:32 AM12/19/09
to Clojure
I want to thank you all for your suggestions, the clojure community is
really great!

On Dec 18, 1:35 pm, Patrick Kristiansen
<patrick.kristian...@gmail.com> wrote:

kyle smith

unread,
Dec 21, 2009, 9:07:10 AM12/21/09
to Clojure
I also vote for static analysis. People have combined static analysis
with data mining to predict software defects.
http://scholar.google.com/scholar?hl=en&q=data+mining+software+defects&btnG=Search&as_sdt=2000&as_ylo=&as_vis=0
However, it doesn't look like anyone's tried it with any functional
languages.
Reply all
Reply to author
Forward
0 new messages