Re: Extending Clojure's STM with external transactions

22 views
Skip to first unread message

Alyssa Kwan

unread,
Aug 29, 2010, 10:37:00 PM8/29/10
to Clojure
I'm resurrecting this thread from quite a while ago... I'm very
interested in being able to ensure that the state of a ref is
persisted as part of the same transaction that updates the ref.
Performance is not important; correctness is. Has any further work
been done on this front?

http://groups.google.com/group/clojure/browse_thread/thread/aa22a709501a64ac

nchubrich

unread,
Aug 30, 2010, 5:02:06 PM8/30/10
to Clojure
I'm not aware of any, but +1 for seeing persistence handled as part of
the language. A big project and a long-term one, to be sure, but
could it not be considered a goal?

In my student days, I was talking to a well-known Lisper (name
suppressed for fear of Google indexing) about some data structures in
MIT Scheme. When I asked about saving them to disk, he said in
effect, "You're on your own----that's something that \should be
handled, but never is".

I think people are so used to this state of affairs they forget how
ugly it really is. Programming languages are like Moses without
Joshua: they lead your data in the wilderness, but when it comes to
finding it a permanent home, you have to talk to someone else. And
these "someone elses" (who seem to be as numberless as the sons of
Abraham) each have their own habits and ways of talking.

Persistence libraries always end up warping the entire codebase; I've
never succeeded in keeping them at bay. Using data with Incanter is
different from ClojureQL, which is different from just using
contrib.sql, and all of it is different from dealing with just
Clojure. (I've never even tried Clojure + Hibernate.) You might as
well rewrite the program from scratch depending on what you use.
Maybe other people have had better luck; but whatever luck they have,
I'm sure it is a fight to keep programs abstracted from persistence.

I'd like to be able to work with mere Clojure until my program is
complete, and then work in a completely separate manner on how to read
and write data. Or maybe there would be off-the-shelf solutions I
could plug in for different needs: low latency, high read, high write,
large stores, etc.

On the Clojure side, you would simply call something like "persist
namespace", which would save the state of your current or given
namespace (unless you pass it the names of variables as well, in which
case it only saves those). And to read data, you would simply require
or use it into your namespace: you could choose what granularity to
give first-class status: just tables, or columns as well, etc. And
you could do this equally well for XML, JSON, relational data, or a
graph store; your choice. And the only difference between these and
ordinary variables would be----heaven forbid!----a disk access might
happen when you deal with them!

To have such a system work well, you would need to enrich the way you
query Clojure datastructures. I have some ideas on that, but I'd like
to make sure I'm not shouting in the dark first.

I'd like to see a day when programmers need to worry about persistence
about as much as they worry about garbage collection now.

.Bill Smith

unread,
Aug 31, 2010, 11:08:55 AM8/31/10
to Clojure
> I'd like to see a day when programmers need to worry about persistence
> about as much as they worry about garbage collection now.

Me too, but of course beware of leaky abstractions.

Alyssa Kwan

unread,
Sep 1, 2010, 6:14:45 PM9/1/10
to Clojure
I'll go one step further and say that we shouldn't have to call
"persist namespace". It should be automatic such that a change to the
state of an identity is transactionally written.

Let's start with refs. We can tackle the other identities later.

The API is simple. Call (refp) instead of (ref) when creating a
persisted ref. Passed into the call are a persistence address (file
path, DB connection string, etc.) and a name that has to be unique to
that persistence address. Not all refs end up being referred to by a
top-level symbol in a package, and multi-process systems are hard...
Ensuring uniqueness of name is up to the programmer. Upon creation,
Clojure checks to see if the refp exists in the store; if so it
instantiates in memory with that state, else it uses the default in
the call.

In a dosync block, the function runs as normal until commit time.
Then Clojure acquires a transactional write lock on each refp that is
alter-ed or ensure-d. It checks the value against memory. If it's
the same, commit data store changes. If not, retry after refreshing
memory with the current contents of the store. If the data store
commit fails, retry a number of times. If the data store commit still
can't proceed, roll back the whole thing. commute and refset are
slightly different, but for an initial implementation, just treat
commute as an alter, and ignore refset.

Does this make sense?

My intention is to cover the 80% case. The implementation would
necessarily be chatty, since the API is chatty. That's OK for most
systems.

This API has the benefit of being able to be shared across Clojure
instances. It's a nice bonus.

A dosync block may contain symbols pointing to refp's spanning
different data stores, which isn't too hard to handle. It simply
requires that if this is the case, each data store must support two-
phase commit or some other distributed transaction supporting
protocol. For an initial implementation, I would just throw an
exception.

I've begun working on an implementation using BDB.

What do people think?

Linus Ericsson

unread,
Sep 2, 2010, 8:28:40 AM9/2/10
to clo...@googlegroups.com
Persistant variable handling is one of the things which I have spent much time on as a beginner and former SQL-illiterate (Among getting the swank to finally work (it's a dream!)). I have however got into databases quite a bit among the way - but it was not my main goal and it has taken some time from "the real task".

I have looked into clj-record, which seems to be highly usable as well, but the persistant refs-idea IMHO may feel a bit more elegant.

An easy way of persistence would be highly valuable when developing web applications with compojure in :reload-mode, since it seems to lose normal persistant refs after each reload in the browser. It would be valuable in production environments as well, of course.

/Linus

2010/9/2 Alyssa Kwan <alyssa...@gmail.com>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Timothy Baldridge

unread,
Sep 2, 2010, 8:47:03 AM9/2/10
to clo...@googlegroups.com
>It checks the value against memory. If it's
>the same, commit data store changes. If not, retry after refreshing
>memory with the current contents of the store.

May I suggest we take a page from the CouchDB book here? In addition
to having a "id" each ref also has a revision id. Let's say the id for
my ref is "foo". Then the first time I modify that ref the revid
becomes "1-foo". From there the 1- is incremented by one on each
writing transaction. This allows for a single string load & compare
for each change, instead of reading the entire contents of the ref.
That combined with CouchDB's copy-on-write means that you can kill a
couchDB process mid-execution, and then restart it without any data
corruption at all.

If we find a way to do external persistence...that would be awesome, I
can think of several use cases for it right now.

Timothy


--
“One of the main causes of the fall of the Roman Empire was
that–lacking zero–they had no way to indicate successful termination
of their C programs.”
(Robert Firth)

Mike Meyer

unread,
Sep 2, 2010, 12:56:54 PM9/2/10
to clo...@googlegroups.com, alyssa...@gmail.com
On Wed, 1 Sep 2010 15:14:45 -0700 (PDT)
Alyssa Kwan <alyssa...@gmail.com> wrote:

> I'll go one step further and say that we shouldn't have to call
> "persist namespace". It should be automatic such that a change to the
> state of an identity is transactionally written.
>
> Let's start with refs. We can tackle the other identities later.
>
> The API is simple. Call (refp) instead of (ref) when creating a
> persisted ref. Passed into the call are a persistence address (file
> path, DB connection string, etc.) and a name that has to be unique to
> that persistence address. Not all refs end up being referred to by a
> top-level symbol in a package, and multi-process systems are hard...
> Ensuring uniqueness of name is up to the programmer. Upon creation,
> Clojure checks to see if the refp exists in the store; if so it
> instantiates in memory with that state, else it uses the default in
> the call.

First, I like the idea. But I think it's a bit clunky. Putting the
address information directly in the refp means you'll probably need it
in multiple places, as the most common usage is probably to store more
than one data item in each storage. This strikes me as a bad idea,
violating DRY.

How about introducing a second part to the api? (store) creates a
wrapper for the persistent address, and refp then takes one of those
wrappers and the name?

<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.

O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Alyssa Kwan

unread,
Sep 3, 2010, 5:14:14 PM9/3/10
to Clojure
Revision number is a great idea!

I don't think I want to do copy-on-write within Clojure because it
would require a separate thread for cleanup. The underlying database
should take care of it anyways.

Thanks!
Alyssa

Alyssa Kwan

unread,
Sep 3, 2010, 5:11:43 PM9/3/10
to Clojure
ITA. I was planning on doing what clojure.contrib.sql does with a
global var and with-connection block for dynamic binding. Unless
people don't like that approach...

On Sep 2, 12:56 pm, Mike Meyer <mwm-keyword-googlegroups.

nchubrich

unread,
Sep 4, 2010, 12:27:39 AM9/4/10
to Clojure
> How about introducing a second part to the api? (store) creates a
> wrapper for the persistent address, and refp then takes one of those
> wrappers and the name?

I like that. I would go one step further and say refp should have a
default data store that is used unless you specify anything else via
'store' or additional arguments to refp. This would go partway
towards making "persistence like garbage collection".

The reason I suggested persisting entire nampespaces was that
sometimes (often) the need to persist data is outside of any usage of
refs (or atoms, or agents). Maybe such a high-level facility could be
built on top of refp.

Another thing I'd like to see (as I mentioned) is an extension of map
syntax for the kinds of (e.g. SQLish or XMLish) data structures you
get out of persistent stores. One of the brilliant things about
Clojure is its unification of multiple sequence types. It was
something that was crying out to be done. I think that, similarly,
there could be a more unified and standard way of dealing with maps,
maps-of-maps, maps containing lists, etc.

There is already a nice shorthand syntax for maps: to get a key's
value out of a map, you pass it the key:

({:a 1 :b 2} :a) -> 1

How about extending this system to selection among sets of maps (i.e.
relational data), by passing a key-value pair:

(#{{:a 1 :b 2}
{:a 3 :c 4}} {:a 3}) ->

{:a 3 :c 4}

Which could then be further processed by just passing a key or set of
keys to "project":

(... :a) -> 3

Now consider nested maps with lists, which might come from XML or some
other kind of hierarchical data (note this is not what you would get
from read-xml; I'm trying to imagine a "native" structure that doesn't
refer to external concepts such as attributes, content, etc.):

{:catalog [{{:node :book :isbn 100}
{:author :Shakespeare :genre :play}}
{{:node :book :isbn 101}
{:author :Shakespeare :genre :sonnet}}
{{:node :book :isbn 102}
{:author :Knuth :genre :textbook}}]}

You might pass this a sequence of terms "above" a given node to
identify it:

(data [:book :author :Shakespeare :sonnet])

Which yields the entire hierarchy above any identified nodes:

-> {:catalog [{{:node :book :isbn 100}
{:author :Shakespeare :genre :sonnet}}]}

This can then be further processed by "projecting" the node level you
want to look at:

(... [:catalog :node :book]) -> {{:node :book :isbn 100}
{:author :Shakespeare :genre :sonnet}}

(... [:catalog {:node :book} :author]) -> :Shakespeare

You could be as imprecise as you like in doing these "queries"----all
the way from sets of unordered terms (using set notation) to specific
key-value pairs (e.g. in the last example). The idea is to make the
best of your (sometimes partial) knowledge.

. This is a \very rough idea, and it completely ignores some edge
cases, but the basic point is that I am trying to query "relational"
data and hierarchical data in the same way. A further way of
"unifying" external data sources is to converge on a standard way of
dealing with relational or hierarchical data, and meanwhile allowing
the same methods of querying to be used on different representations.

For instance, maps themselves can be represented as maps, as a list of
alternating key-value pairs, a vector of two-place vectors, or a list
of keys and a list of values. We should be able to query all these
things in the same way, and there should be utility functions to
convert between them (I always end up writing my own).

In the case of relational data, different programs pick different
representations. There is Clojure's native way, which is sets of
maps. Incanter has a two item-map with :column-names and :rows.
ClojureQL does something else again. When I have relational-like data
in my own programs, I tend to pick one column as the "key", and map it
to subsidiary data, e.g.:

{:screw {:price 0.1 :supplier :Acme}
:bolt {:price 0.2 :supplier :Nuts}}

It should be possible to query any reasonable representation in the
same way, and writers of libraries meanwhile should try to converge on
a single way of representing these things (why not Clojure's?).

Constantine Vetoshev

unread,
Sep 5, 2010, 11:02:23 AM9/5/10
to Clojure
On Aug 30, 5:02 pm, nchubrich <nchubr...@gmail.com> wrote:
> Persistence libraries always end up warping the entire codebase; I've
> never succeeded in keeping them at bay.  Using data with Incanter is
> different from ClojureQL, which is different from just using
> contrib.sql, and all of it is different from dealing with just
> Clojure.  (I've never even tried Clojure + Hibernate.)  You might as
> well rewrite the program from scratch depending on what you use.
> Maybe other people have had better luck; but whatever luck they have,
> I'm sure it is a fight to keep programs abstracted from persistence.

I feel the same way. Late last year, I wrote a small BDB-based library
for adding disk persistence to Clojure datatypes. It does require
knowing in advance which types you want to write to disk. I haven't
had time to work on it recently, but I'd like to pick it up again
sometime soon. (And I welcome help — the wonders of open-source
software and all that.)

http://github.com/gcv/cupboard

It has a few outstanding problems.

1. It needs to be updated for deftype and defrecord in Clojure 1.2.
2. Deadlock detection doesn't seem to work 100% of the time, and I
haven't yet tracked down the reasons. Parts of the test suite
currently fail as a result.
3. Reads are slow. Casual profiling hasn't revealed the reasons, so
it's a bit tricky to track down.

Alyssa Kwan

unread,
Sep 4, 2010, 9:34:08 PM9/4/10
to Clojure
OK... This question is probably better directed at clojure-dev, but my
membership is still pending. I'm having trouble interpreting
LockingTransaction.run. Where exactly are read-locks for ensures set?
And what precisely is in the commutes map and sets set? Why does
membership in sets short-circuit the commutes loop, and what if there
is a read lock on a member of the sets set?

Everything is done with the exception of the actual compare and set
operation with the BDB database. In other words, the real work. :)

Thanks!
Alyssa

Alyssa Kwan

unread,
Sep 5, 2010, 9:20:34 PM9/5/10
to Clojure
> > How about introducing a second part to the api? (store) creates a
> > wrapper for the persistent address, and refp then takes one of those
> > wrappers and the name?
>
> I like that.  I would go one step further and say refp should have a
> default data store that is used unless you specify anything else via
> 'store' or additional arguments to refp.  This would go partway
> towards making "persistence like garbage collection".

Great suggestion!

> The reason I suggested persisting entire nampespaces was that
> sometimes (often) the need to persist data is outside of any usage of
> refs (or atoms, or agents).  Maybe such a high-level facility could be
> built on top of refp.

I disagree. There three kinds of identities - those that mutate in a
multi-process environment, those that mutate in a guaranteed single-
process environment, and those that don't mutate.

If you can make outside changes to the underlying data store at
runtime (DB or even properties file), through either another JVM or
manually, then it's inherently multi-process. Multi-process mutation
of identities belong in refs et al.

The second case, mutation in a single-process, is synonymous with vars
in Java/Clojure, where logical processes are VM threads. The problem
is that vars are thread local; how in the heck do you name the thread-
specific var for persistence? And read from them, given that the next
time the number and nature of threads will be different; how do
threads in a future VM map to threads in a previous one; threads are
inherently volatile. I see the use case of long-running processes
having a clear identity and needing to persist/recover state, but it's
a much smaller use case than the first (multi-process mutation).

Identities that don't mutate are like defonce, where the db state
being read is locked for the duration of the VM. The only real use
case is a properties database that can only be edited when the VM is
shut down.

I guess what I don't see is why you would need to OFTEN persist data
outside of refs... The second case, though legitimate, seems minor,
and can be easily modeled with the first. The third case (the
properties database) is very often, but should easily be addressed by
libs (though I agree it should be part of the language). I guess what
I'm saying is that we should probably be using refs if persistence is
required.

> Another thing I'd like to see (as I mentioned) is an extension of map
> syntax for the kinds of (e.g. SQLish or XMLish) data structures you
> get out of persistent stores.  One of the brilliant things about
> Clojure is its unification of multiple sequence types.  It was
> something that was crying out to be done.  I think that, similarly,
> there could be a more unified and standard way of dealing with maps,
> maps-of-maps, maps containing lists, etc.

There must be someone on this list who can develop a formally correct
regular expression language for hierarchical data. The formal
correctness part is not my strong suit. I'm not too familiar with XSL,
but would a less verbose version of that work?

Thanks!
Alyssa

Alyssa Kwan

unread,
Sep 5, 2010, 8:56:05 PM9/5/10
to Clojure
Thanks, Constantine! Your work on cupboard is awesome! I'll take a
look at the deadlock detection to see if I can help.

Any thoughts on how to marshal functions? What about vars and dynamic
binding?

Thanks!
Alyssa

Constantine Vetoshev

unread,
Sep 7, 2010, 10:25:26 AM9/7/10
to Clojure
On Sep 5, 8:56 pm, Alyssa Kwan <alyssa.c.k...@gmail.com> wrote:
> Any thoughts on how to marshal functions? What about vars and dynamic
> binding?

I don't think marshaling closures will ever happen without changes to
Clojure itself. I haven't looked into how much work it would require,
or how much it would impact Clojure's performance. It always seemed
like an excessively lofty goal anyway: if I could save plain Clojure
data structures (all primitives, all fully-evaluated collections, and
all records), I would be happy with it. In truth, I always wanted to
extend Cupboard to support some kind of semi-magical distributed
storage (like Oracle Coherence, but with better persistence guarantees
— database-like rather than cache-like), but wanted to get single-node
basics working properly first. The latest BDB JE has some replication
built-in, and I planned to use it.

As for dynamic binding, I'm not sure what you mean. The bound value
will evaluate using Clojure's normal rules when cupboard.core/make-
instance runs, and go into the database. cupboard.core/query will then
read it and make the value part of the returned map (it should really
be a Clojure 1.2 record). The code doesn't do anything except save and
restore fully-evaluated data structures.

Incidentally, Cupboard wraps BDB transactions, and does not attempt to
work with Clojure's STM subsystem. I always considered this a
weakness, but a difficult one to resolve. To counterbalance it, I
planned to avoid mixing STM and on-disk data structures in the same
code.
Reply all
Reply to author
Forward
0 new messages