Clojars's data base

58 views
Skip to first unread message

Phil Hagelberg

unread,
Jan 5, 2013, 5:39:36 PM1/5/13
to clojars-m...@googlegroups.com

I've been working on and off on the releases repository for Clojars, but
it's been slow going as I keep bumping into new problems with the SQLite
JDBC drivers. I apologize for my poor communication here. I have hit a
bit of a dead end where I don't think my current promotion plan will
work without replacing the database.

While considering database alternatives I came upon a promising idea.
Rather than being a CRUD application, Clojars could be restructured as
storing a stream of events and querying an index of that event stream.

I believe the read-only portion of Clojars could be implemented
primarily from reading from the repository on the filesystem, though
it would need to be augmented by these queries:

* User authenticated?
* List all members of a group
* Search
* Recently pushed jars

The write events that Clojars deals with are these:

* Profile updates (registration is a special case of this?)
* Membership grants
* Deploys

If each of these events results in adding a line to an append-only
event log file, we could construct Lucene indices from those logs
which could be use to satisfy the required queries. As events happened
they would also update the index in real time, but if a bug was
discovered it would be easy to rebuild the indices. This feels like a
much more functional approach in line with the spirit of Clojure.

Eventually we'd need to support rolling and compressing old event logs
and possibly collapsing them down to only the relevant events, but I
don't believe there are any particularly technically challenging about
that.

I welcome your thoughts on this matter.

-Phil

Phil Hagelberg

unread,
Jan 7, 2013, 7:53:19 PM1/7/13
to clojars-m...@googlegroups.com

Phil Hagelberg <ph...@hagelb.org> writes:

> If each of these events results in adding a line to an append-only
> event log file, we could construct Lucene indices from those logs
> which could be use to satisfy the required queries. As events happened
> they would also update the index in real time, but if a bug was
> discovered it would be easy to rebuild the indices. This feels like a
> much more functional approach in line with the spirit of Clojure.

After some chat on IRC it was pointed out that that the users/groups
data set is tiny and could easily fit in memory far into the foreseeable
future. I'm going to spike out an implementation that simply builds maps
in two atoms in memory and only uses indices for searching over
artifacts.

-Phil

Alex Osborne

unread,
Jan 9, 2013, 5:18:18 AM1/9/13
to clojars-m...@googlegroups.com
Sounds good. I like the idea of getting rid of the database entirely.

Toby Crawley

unread,
Feb 4, 2013, 10:32:12 AM2/4/13
to clojars-m...@googlegroups.com

Phil Hagelberg writes:

> If each of these events results in adding a line to an append-only
> event log file, we could construct Lucene indices from those logs
> which could be use to satisfy the required queries. As events happened
> they would also update the index in real time, but if a bug was
> discovered it would be easy to rebuild the indices. This feels like a
> much more functional approach in line with the spirit of Clojure.

I really like the event log approach. I do have a question about how
it works in the current production environment - I see that the
clojars architecture diagram shows a second instance for failover and
rolling deploys. Is that instance still used? If so, how do you handle
a file-based event log between two instances? If they have access to
the same filesystem, is there a chance of both instances writing to
the log at the same time? The code currently handles locking by using
the locking macro on 'record. But that lock will only work within one
instance, correct?


--
Toby Crawley
http://immutant.org | http://torquebox.org

Phil Hagelberg

unread,
Feb 4, 2013, 8:48:58 PM2/4/13
to clojars-m...@googlegroups.com

Toby Crawley writes:
> I really like the event log approach. I do have a question about how
> it works in the current production environment - I see that the
> clojars architecture diagram shows a second instance for failover and
> rolling deploys. Is that instance still used? If so, how do you handle
> a file-based event log between two instances?

Good point. Right now I believe that the backup instance is only used
during deployment to check that we're all-clear before bouncing the
primary instance. I don't think that we actually have production traffic
falling back to it, but we should double check the logs to confirm this.

Even so it would probably be wise to either use file locking on the
event logs or to configure the backup instance to write to a different
directory and come up with a strategy for reconciling the two. Since
everything is timestamped, it should be easy to merge the two streams
after the fact. But we should consider that in our plans.

Thanks for bringing this up.

-Phil
Reply all
Reply to author
Forward
0 new messages