Best Practices for data archiving code based in GitHub

2 views
Skip to first unread message

Michael Steeleworthy

unread,
Sep 24, 2021, 11:11:56 AM9/24/21
to Canadian Data Curation Network
Hi all and apologies for cross-posting.

I'm looking for a best practices primer for curating and potential archiving a dataset associated with code that is frequently updated in GitHub.  

In short: What commit(s) do you point to in your record when the depositor has been using many iterations of  code?  Will the most recent commit suffice?

The Situation : 
  • A grad student's dataset, being deposited into our repository alongside the thesis, is associated with code in a GitHub repo that he does not control.   
  • This code is frequently updated and critical to the work of a large research group that stands to continue beyond the length of its current funding. (It probably isn't going anywhere)  The student isn't formally affiliated with the research group.   (The student doesn't really have a say about the code.)
  • The student has used this code for a number of years.  
  • Multiple commits have occurred over time, which seems to improve but never affected the code's core functionality.  No formal releases have been made by the research group.  
  • While unlikely and rare, it is possible that the research group could delete the GitHub repository and all the commits.  For this reason, we are planning to point to commits archived by Software Heritage rather than GitHub.  

Our question:  What commits should we point to and what is the best practice here?  The student knows the full range of commits he's used.  Functionally, the code (and its effects on the student's outputs) are the same in the range.   

I am leading toward pointing to only the last commit used by the student as a benchmark in the metadata and for the thesis.  I'm happy to get your input and advice, though.

Thanks,
Michael at Wilfrid Laurier University.

Reply all
Reply to author
Forward
0 new messages