a thought experiment: datasets as git repos

41 views
Skip to first unread message

Philip Durbin

unread,
Mar 8, 2014, 2:14:35 PM3/8/14
to dataverse...@googlegroups.com

Tom Roche

unread,
Mar 13, 2014, 5:51:35 PM3/13/14
to dataverse...@googlegroups.com

https://groups.google.com/d/msg/dataverse-community/5zJrr03R9ZE/6ahp8ZgQwt8J
[Philip Durbin Sat, 8 Mar 2014 14:14:35 -0500]
> https://docs.google.com/document/d/18WDIS8hrFJvMJBcnRuQ8NfD-VxGq32vJ9WwlEgyyWZs/edit?usp=sharing

Since I can't apparently comment to the gdoc directly, thought I'd add here:

> What about notes? What about comments?

Some groups use web repos' section=Issues for this purpose, e.g.

https://github.com/swcarpentry/bc/issues/375

Regarding the use of "web repo" where the doc uses (repeatedly) "GitHub":

While I'm sure Tom Preston-Werner and Andreessen Horowitz would like one to believe it, actually git != GitHub :-) and there are several providers of remotely accessible git repositories with ancillary web-accessible services (issues, wikis, etc) that are roughly equivalent--see, e.g., column=Git in

http://en.wikipedia.org/wiki/Comparison_of_open-source_software_hosting_facilities#Available_version_control_systems

Deciding which web repo to use should be determined by several factors, including

* availability of in-project binary storage (i.e., Downloads)

* need to switch between public and private access (and willingness to pay for it)

* wiki syntax support

* desire for coolth (the one area where GitHub completely dominates :-)

FWIW, Tom Roche <Tom_...@pobox.com>

Philip Durbin

unread,
Mar 13, 2014, 8:10:29 PM3/13/14
to dataverse...@googlegroups.com
Hi Tom! Sorry, this was buried in the IRC log but everyone who would
like to comment on the Google Doc is welcome to send me their Gmail
address so I can give them permission to comment.

Of course, it's also fine to discuss right here on this mailing list.
That's what it's for. :)

You're absolutely right that an issue tracker per repo is an excellent
way capture notes and comments. Unlike git itself, issue tracking is
not standardized. Every issue tracker works a little differently, and
it makes sense for dataset owners to chose their favorite. Or they can
have a mailing list or receive personal email or whatever.

I used GitHub as a point of comparison throughout the document because
it's well known.

Oh, I just read this paper and it's good and on topic for this thread:
Git can facilitate greater reproducibility and increased transparency
in science - http://bitss.org/2014/03/12/git-reproducibility-transparency/

The author even talks about using git for datasets.

Phil

p.s. Some more chatter today about datasets as git repos:
http://irclog.iq.harvard.edu/dvn/2014-03-13
> --
> You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
> To post to this group, send email to dataverse...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/874n31sr6g.fsf%40pobox.com.
> For more options, visit https://groups.google.com/d/optout.

Philip Durbin

unread,
Mar 13, 2014, 10:08:21 PM3/13/14
to dataverse...@googlegroups.com
Ok, I opened an issue on his paper:

datasets as git repos: just files or also cataloging
information/metadata? - https://github.com/karthik/smb_git/issues/22

:)

Here's the HTML version by the way: http://www.scfbm.org/content/8/1/7

Phil

p.s. Tom, I just added a comment to the top of the Google Doc
indicating how people can contact me if they'd like to comment on the
doc.

Tom Roche

unread,
Mar 15, 2014, 12:35:07 AM3/15/14
to dataverse...@googlegroups.com

Thanks, Philip, for setting this issue! and for pointing me @ https://github.com/karthik/smb_git (except that it's dragging me away from some pressing gruntwork :-)

https://github.com/karthik/smb_git/issues/22#issue-29405240
>> For the case of storing datasets in git, have you thought much about
>> cataloging information or metadata about each dataset? [...]
>> What would your METADATA.json file look like?

https://github.com/karthik/smb_git/issues/22#issuecomment-37716764
> how 'bout facilitating connecting a git repository to a (separate, freestanding) data repository,
> so as to co-version (in the git repo) the data repo's metadata with the git repo's code [using git-annex]?

In case, like me, you've never heard of `git-annex` ... check it out! (see previous link) And thanks to Karthik Ram for the pointer.

> this would solve [Philip's] problem [above:] you're using someone else's data repo(s), you store their metadata in their format.

Seems so simple it couldn't possibly work. What am I missing? Comment on the GH issue!

TIA, Tom Roche <Tom_...@pobox.com>
Reply all
Reply to author
Forward
0 new messages