We're two students that have been working with concurrent programming
languages (Erlang and Clojure), and while both languages are very
interesting, we would like to work on something related to Clojure in
our masters thesis.
I once asked on #clojure for ideas, and Rich Hickey suggested looking
into predicate dispatch and it is one idea that we are considering. We
have also considered working on distributed Clojure, but I don't know
if there is already an effort going on in that regard?
Do you have any other suggestions? We'd be really thankful for any
suggestions.
Thanks in advance.
-Patrick
Here's a link on Clojure + Terracotta. I have heard mention of other
solutions as well, I just don't recall them.
http://groups.google.com/group/clojure/browse_thread/thread/9a9690947617906/6bc50fb45ca56e7d
Sean
On Dec 18, 7:35 am, Patrick Kristiansen
Hi Ajay,
> An idea I was interested in (more from an learning opportunity
> perspective than thesis) is provided more Persistent data structures.
> As of now, we have the the Array mapped Hash tree based structure that
> works well for Vectors, Maps and Graphs too.
In the last few month I didn't find any time to play around with clojure
or follow discussions on this list, so could you give me a pointer to
persistent graph structures? That would be an area where I would be
utmost interested.
Bye,
Tassilo
THE BIG CAVEAT> Two unusual aspects of the Google AppEngine environment create pretty major> constraints on your ability to write idiomatic Clojure.> First, an AppEngine application runs in a security context that doesn't> permit spawning threads, so you won't be able to use Agents, the> clojure.parallel library, or Futures.> Second, one of the most exciting features of AppEngine is that your> application will be deployed on Google's huge infrastructure, dynamically> changing its footprint depending on demand. That means you'll potentially be> running on many JVMs at once. Unfortunately this is a strange fit for> Clojure's concurrency features, which are most useful when you have precise> control over what lives on what JVM (and simplest when everything runs on> one JVM). Since shared references (Vars, Refs, and Atoms) are shared only> within a single JVM, they are not suitable for many of their typical uses> when running on AppEngine. You should still use Clojure's atomic references> (and their associated means of modification) for any state that it makes> sense to keep global per-JVM, since there may be multiple threads serving> requests in one JVM. But remember JVMs will come and go during the lifetime> of your application, so anything truly global should go in the Datastore or> Memcache.
We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications.
The foremost requirements of web applications are scalability, consistently good response time for geographically dispersed users, and high availability. At the same time, web applications can frequently tolerate relaxed consistency guarantees.
For example, if a user changes an avatar ... little harm is done if the new avatar is not initially visible to one friend .... It is often acceptable to read (slightly) stale data, but occasionally stronger guarantees are required by applications.
PNUTS provides a consistency model that is between the two extremes of general serializability and eventual consistency ... We provide per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order .... The application [can] indicate cases where it can do with some relaxed consistency for higher performance .... [such as reading] a possibly stale version of the record.
Erm, what's your master? I'll assume CS.
Personally, I'm interested in whether complete thread abstraction that
makes "threads" as light-weight as possible, but also the only way to
do concurrency (like Erlang provides with its "processes") is really
the best way to model concurrent programs. I'm over 95% sure that
native threads really are not the best way to model in-process
concurrency for most programs, simply because of all the overhead that
you incur especially when dealing with massively concurrent long-
running thead-like-things - since you both know erlang, you will know
what I'm talking about - that sort of approach really doesn't work in
clojure, that's why clojure uses thread pools for agents etc.
But maybe a new and lightweight in-kernel thread model is an
interesing subject? Just asking :)
Good luck with your thesis,
Joost Diepenmaat.
Well it's software engineering, but close enough :). I should have
mentioned that.
> Personally, I'm interested in whether complete thread abstraction that
> makes "threads" as light-weight as possible, but also the only way to
> do concurrency (like Erlang provides with its "processes") is really
> the best way to model concurrent programs. I'm over 95% sure that
> native threads really are not the best way to model in-process
> concurrency for most programs, simply because of all the overhead that
> you incur especially when dealing with massively concurrent long-
> running thead-like-things - since you both know erlang, you will know
> what I'm talking about - that sort of approach really doesn't work in
> clojure, that's why clojure uses thread pools for agents etc.
>
> But maybe a new and lightweight in-kernel thread model is an
> interesing subject? Just asking :)
Thanks for your suggestion. It's an interesting idea, and we've been
considering something along those lines. I don't know if Erlang's
processes are even more lightweight than "green threads". According to
Wikipedia [1], Linux actually performs really well in terms of context
switching OS threads compared to green threads. But I can't find the
paper Wikipedia cites, and it seems not to be downloadable from ACM
[2].
In any case, we will take this suggestion into consideration.
[1] http://en.wikipedia.org/wiki/Green_threads#Performance
[2] http://portal.acm.org/citation.cfm?id=603551
The problem seems to be two part:
1. Starting new "native thread" takes a relatively long time
2. Native threads means a fairly significant memory and scheduler
consumption. In Java-based threads, that means for example that a
20,000 connection server is pretty much the maximum that's possible in
that model. In Erland, 20,000 concurrent connections is what you
should expect from an unoptimized system. This is at least partly due
to kernel design, but also due to language design. A purely side-
effect free language is much more capable of sharing data. In *that*
regard, it's probably easier to optimize Erlang than Clojure.
But in any case, massive networking is the key - In clojure, there is
plenty of room to improve in that regard. In Erlang probably too, but
erlang has no really interesing memory sharing mechanisms.
Re:
Update: One of the developers of PNUTS commented on this post, pointing out that PNUTS performance is much better in practice (1-10ms/request) when caching layers are in place and making a few comparisons to Bigtable.
Hi folks,
I'm another member of the PNUTS group at Yahoo! This has been a very interesting discussion; I'm glad you folks are as interested in this stuff as we are.
Just to reiterate a few points that Utkarsh brought up:
- There's no free lunch for performance, and if you want consistency some writes will have to go cross-datacenter, increasing the average latency. Cross-datacenter communication is required because of our mastership protocol. Consider a user who's record is mastered in California. If she flies to Europe, or a network problem causes her request to be redirected to a datacenter on the East coast, then her write will originate in a non-master datacenter and be forwarded back to the master for that record. In practice this happens 10-20% of the time, just because of the way web requests happen.
Even if you had Paxos, and managed to put the "local" participants in the same datacenter, occasionally a write would originate in the non-"local" datacenter and pay the cross-datacenter latency to find enough members of the quorum. So this cost is really unavoidable.
- You could weaken the consistency, to something like "eventual consistency" (write anywhere and resolve conflicts later) or even "best effort" (write anywhere and don't worry about conflicts) and avoid ever paying the cross-datacenter cost. And in fact it is possible to turn off mastership in PNUTS. But then you need a resolution protocol, and until conflicts are resolved inconsistent copies of the data are visible to readers. So again there is no free lunch.
- Anonymous is write that this system is not as optimized for scans a la MapReduce. In fact, you make different architectural decisions if you are optimizing for low-latency updates to a geographic database than if you are optimizing for scanning for MapReduce within a single datacenter. We have been able to run Hadoop jobs over PNUTS and get performance that is pretty good, just not as good as a native store (HDFS) optimized for MapReduce. So if you want to transactionally update data with very low latency and occasionally run MapReduce, you can use PNUTS; if you want to always run MapReduce but don't need a lot of high performance updates, use HDFS.
- As Utkarsh has said, the hardware used for the paper is not as good as production hardware (e.g. no battery-backed write cache, and other limitations). We hope to publish some numbers from live workloads on the production system soon.
I know, I know - it's not very sexy but it would go a long way for
companies with larger legacy Java code bases that may be considering
moving to Clojure. Being able to replace Java code 'one step at a
time' would be invaluable in that area and it also makes Clojure
easier to 'sell'.
On Dec 18, 4:35 am, Patrick Kristiansen
1. detect incorrect uses of transients (those that would certainly
lead to exceptions) and report them during compile-time.
2. related to (1), but different: detect places where the compiler can
implicitly convert collections to transients and back, without losing
correctness, and use that as a speed-up.
3. soft typing, either to speed up code or to detect incorrect code
early. Such things have been done with Scheme.
I believe Clojure has certain advantages when working on static
analysis:
* you can call the reader and macro expander during analysis, which
greatly simplifies the language to be analyzed.
* the core language is relatively simple
* the compiler source is freely available and is written in a
relatively high-level language (java and not c), so it (should) be
relatively easy to get into.
Relatedly, it might be interesting to create formal semantics for
Clojure, and interface it with a proof assistant, enabling proofs to
be written and checked about the behavior of clojure programs.
In that context, one of the opportunities provided by lispiness is
that macros can help write the proof, by automatically creating
propositions about the code that they return. For example, (dotimes [i
n] body) can be modified to attach these propositions to its output,
using an appropriate proof language:
- i>=0
- i<(the result of) n
- the loop terminates if its body terminates
That's just scratching the surface. That field is deep and wide, and a
complete implementation is probably beyond the scope of a master's
theses. Still, parts of it can be done.
Your Google search skills are obviously beyond ours. :) I've found it
now.
def foo(x) where x is Animal = A
def foo(x) where x is Chicken = B
def foo(x) where x is Bird = C
If you call foo(new Crow) you want the method with "where x is Bird"
to be called. A Crow is not a Chicken, so we can rule out the Chicken
method. But a Crow is also an Animal, so how does the dispatch decide
that it calls the Bird method and not the Animal method? The answer of
the authors of the predicate dispatch paper is that you determine
implication between predicates. If x is a Bird, that implies that x is
an Animal. So we test if x is a Bird and only after that we consider
the Animal method.
This is good but unfortunately checking implication between predicates
is undecidable. For example if we define this:
def bar(n) where 5^n == 5 mod n = A
def bar(n) where prime?(n) = B
How is the static analysis going to determine that the second
predicate implies the first? You can solve this case by requiring the
programmer to provide implication information to the compiler, like
this:
implication prime?(n) => 5^n == 5 mod n
As you can see this isn't nice. In my opinion we need a better way to
do method ordering. I've been thinking about this and my proposal is
to use the definition order in the source code to decide which methods
to check first: try methods that are defined last first.
def baz(x) where p(x) = A
def baz(x) where q(x) = B
def baz(x) where r(x) = C
We first check for r(x) and if it's true we execute C. If it's false
then we check if q(x) is true and execute B. If q(x) is false we check
p(x) and execute A.
The implementation of this kind of predicate dispatch is trivial. You
don't need a complicated implication checker. It's just a simple
macro. But the question is whether this simple method ordering rule
works in practice. With this order you can do everything that you can
do with a class hierarchy (in Java for example). It also subsumes
Clojure's hierarchies by using a predicate this is itself a method. I
think it's interesting to explore this further to see if this order
still works well for more advanced uses of predicate dispatch, and to
explore alternative orderings.
Some other research ideas:
- Functional reactive programming in Clojure. A problem with FRP is
that you have to invert dependencies. Instead of specifying for each
cause what its effects are, you have to specify for each effects what
its possible causes are. Can you solve this? The problem is much
easier if you don't try to prevent glitches as other FRP systems do.
Is a system without glitch prevention useful, is introducing glitches
for removing cause and effect inversion a good trade off?
- Type/value inference with abstract interpretation. Can you write an
abstact interpretation to infer possible values of a variable? Is this
useful in creating a better compiler. Can the information be used to
create better editors (intellisense?).
- In a statically typed language like Haskell & Scala you can write a
generic sum function that sums a list of things. You cannot do this in
any dynamic language that I know of. You want sum to work on any list
of things that support +. Suppose that numbers support + and strings
support + (maybe they shouldn't but bear with me for this example). sum
([1,2]) = 3. sum(["ab","cd"]) = "abcd". Now comes the problem. What is
sum([])? For numbers it should be 0, for strings it should be "". In a
statically typed language you can use the type of the list to
determine which value to return. How can we write a generic sum
function in a dynamic language?
Jules
On Dec 18, 1:35 pm, Patrick Kristiansen
On Dec 18, 1:35 pm, Patrick Kristiansen
<patrick.kristian...@gmail.com> wrote: