Hello everyone,
I am the lead developer of Labstory (
www.labstory.se), a documentation/database system. I just heard of the OSF and will certainly be following your progress! We want to achieve the same thing but took a different path down the road; we are trying to find a commercial model to fund our development, but coming from open source we are giving away most of it for free (and all documents have globally unique IDs, can easily be shared, the server supports anonymous login etc). But not going to make this into an ad - the particular topic I would like to address is persistent IDs (PIDs) and documentation.
The problem we are trying to solve is how to document the use of big datasets, such as FASTQ files. It is common that just a file path to the data is stored but this is not enough; files usually move around plenty. Instead we have DOI, but also several other initiatives with different benefits (e.g. that you don't have to pay for many of the others is a big one). However, not even using a PID is good enough, although a lot better. In addition we need to store fingerprints of the files, which are then signed cryptographically together with the rest of the document.
What we are pushing for is both increased use of persistent DOIs - already when the data is generated, not 4 years later with the publication. But also a standardization of how files are hashed so we can store and verify the hash later. In addition we are working on file format independent hash methods, which will allow us to phase out awful formats such as FASTQ in favour of better compressed ones. We are also looking into file format independent hashes for microscopy images.
The format we are trying to standardize on is currently called FILEID. For a file x.fastq, you create x.fastq.fileid, with DOIs, PIDs and hashes. Further, it actually contains all author information. The file is ~4kb and easy to integrate in pipelines, and does not require changing all existing files. The format is essentially JSON-CSL (already used by all citation management software) with some trivial extensions. We will likely extend it further to be able to contain keys for accessing PID-sites (such as Handle.net servers). The file format supports multiple different PID standards in parallel so it doesn't matter which standard wins in the end. But this is the basic idea - our prototype code for all of this can be found at
https://github.com/mahogny/citeproc-lightand is also integrated in Labstory (text section -> insert -> raw file citation). PID-requesting code is not yet in the library but is on the way
We are looking at what others are doing and I would be happy to hear how you people are trying to address this question, or if you have any comments. If you haven't already tried to develop a format maybe you want to join in our endeavour?
Yours sincerely,
Johan Henriksson