We Need Distributed Revision/Version Control for Data

25 views
Skip to first unread message

Rufus Pollock

unread,
Jul 13, 2010, 8:02:28 AM7/13/10
to get-t...@googlegroups.com
Yesterday I wrote a post on distributed revision/version control for data:

<http://blog.okfn.org/2010/07/12/we-need-distributed-revisionversion-control-for-data/>

I'd be very interested to hear any thoughts people have, or any useful
pointers to existing technology.

Regards,

Rufus

<excerpt>
In the open data community, we need tools for doing distributed
revision/version control for data like the one’s that already exist
for code.

(Don’t know what I mean by revision control or distributed revision
control? Read this)

Distributed revision control systems for code, like mercurial and git,
have had a massive impact on software development, and especially so
in the F/OSS community — the distributed methodology works
particularly well with open material.

The same would be true for data. Revision control, and specifically
distributed revision control, would support (cf this and this earlier
post):

* Incremental development: “patches”, changelogs etc
* Provenance tracking: showing who did what, when is built in to a
revisioning system
* Broader participation: you don’t have to worry (as much) about who
you let in because changes can be reverted. It’s also easier to get
involved because you can have your own independent copy to play around
with (Distributed).
* Easier collaboration: updates don’t mean making a full copy (and
applying updates is automatic), you can see who is making changes,
when etc etc
* Peer-2-peer model: different contributors can work simultaneously
and independently (Distributed). Extra “features” can added
independently of mainline development with re-integration later
(Distributed).

Because this is all a bit abstract it is worth giving a concrete
example of why “distributed” revision control could be so useful.

...
</excerpt>

--
Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/

Sid

unread,
Jul 13, 2010, 5:09:58 PM7/13/10
to get.theinfo

This is a very important topic for which i've been trying to look up a
solution, but haven't found anything worthwhile. I would love to hear
other people's input on this one too.

Most of my data sets have been column based(more relational), so my
current solution reflects that. I've been storing it in a mysql
database and use Django frameworks built in ORM. There are tools
providing migrations like django South, which provide both data,
schema migration. So i version control the migration scripts with git
and not the actual data. I don't know if that solution works for all
datasets, or with more than 2-3 people working on it.


-Sid
http://sidmitra.com | twitter.com/sidmitra

Kumar McMillan

unread,
Jul 13, 2010, 5:39:43 PM7/13/10
to get-t...@googlegroups.com
Google Wave was created to solve this kind of problem with the added
bonus of "it must happen in realtime." However, the first thing you
might ask yourself is do I really need to solve the problem of
distributed editable documents? REALLY? It introduces a lot of
complexity and if you are in fact just trying to make something like a
distributed wiki then there is probably an easier way.

Disclaimers aside, you can read all the gory details about how Google
Wave makes it possible to edit documents in many places at once:
http://wave-protocol.googlecode.com/hg/spec/conversation/convspec.html
From: http://www.waveprotocol.org/draft-protocol-specs

"A wave comprises a set of concurrently editable structured documents
and supports real-time sharing between multiple participants."

_

Additionally, you might find this talk interesting:
http://us.pycon.org/2010/conference/schedule/event/72/

> --
> [from the http://groups.google.com/group/get-theinfo mailing list]

Kumar McMillan

unread,
Jul 13, 2010, 5:42:55 PM7/13/10
to get-t...@googlegroups.com
Google Wave was created to solve this kind of problem with the added
bonus of "it must happen in realtime." However, the first thing you
might ask yourself is do I really need to solve the problem of
distributed editable documents? REALLY? It introduces a lot of
complexity and if you are in fact just trying to make something like a
distributed wiki then there is probably an easier way.

Disclaimers aside, you can read all the gory details about how Google
Wave makes it possible to edit documents in many places at once:
http://wave-protocol.googlecode.com/hg/spec/conversation/convspec.html
From: http://www.waveprotocol.org/draft-protocol-specs

"A wave comprises a set of concurrently editable structured documents
and supports real-time sharing between multiple participants."

_

Additionally, you might find this talk interesting:
http://us.pycon.org/2010/conference/schedule/event/72/


On Tue, Jul 13, 2010 at 4:09 PM, Sid <sidmit...@gmail.com> wrote:
>

Reply all
Reply to author
Forward
0 new messages