Persistent IDs?

Johan Henriksson

unread,

Feb 18, 2014, 11:30:15 AM2/18/14

to openscienc...@googlegroups.com

Hello everyone,

I am the lead developer of Labstory (www.labstory.se), a documentation/database system. I just heard of the OSF and will certainly be following your progress! We want to achieve the same thing but took a different path down the road; we are trying to find a commercial model to fund our development, but coming from open source we are giving away most of it for free (and all documents have globally unique IDs, can easily be shared, the server supports anonymous login etc). But not going to make this into an ad - the particular topic I would like to address is persistent IDs (PIDs) and documentation.

The problem we are trying to solve is how to document the use of big datasets, such as FASTQ files. It is common that just a file path to the data is stored but this is not enough; files usually move around plenty. Instead we have DOI, but also several other initiatives with different benefits (e.g. that you don't have to pay for many of the others is a big one). However, not even using a PID is good enough, although a lot better. In addition we need to store fingerprints of the files, which are then signed cryptographically together with the rest of the document.

What we are pushing for is both increased use of persistent DOIs - already when the data is generated, not 4 years later with the publication. But also a standardization of how files are hashed so we can store and verify the hash later. In addition we are working on file format independent hash methods, which will allow us to phase out awful formats such as FASTQ in favour of better compressed ones. We are also looking into file format independent hashes for microscopy images.

The format we are trying to standardize on is currently called FILEID. For a file x.fastq, you create x.fastq.fileid, with DOIs, PIDs and hashes. Further, it actually contains all author information. The file is ~4kb and easy to integrate in pipelines, and does not require changing all existing files. The format is essentially JSON-CSL (already used by all citation management software) with some trivial extensions. We will likely extend it further to be able to contain keys for accessing PID-sites (such as Handle.net servers). The file format supports multiple different PID standards in parallel so it doesn't matter which standard wins in the end. But this is the basic idea - our prototype code for all of this can be found at
https://github.com/mahogny/citeproc-light
and is also integrated in Labstory (text section -> insert -> raw file citation). PID-requesting code is not yet in the library but is on the way

We are looking at what others are doing and I would be happy to hear how you people are trying to address this question, or if you have any comments. If you haven't already tried to develop a format maybe you want to join in our endeavour?

Yours sincerely,
Johan Henriksson

Philip Durbin

unread,

Feb 18, 2014, 9:42:36 PM2/18/14

to openscienc...@googlegroups.com

One option for signing files together cryptographically is UNF: http://thedata.org/publications/fingerprint-method-verification-scientific-data

--
You received this message because you are subscribed to the Google Groups "Open Science Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openscienceframe...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Johan Henriksson

unread,

Feb 19, 2014, 8:09:34 AM2/19/14

to openscienc...@googlegroups.com

Hello!

Indeed, UNF is an option we are looking at. UNF however leaves out many practical problems as it works on raw data as opposed to the files we encounter in real life. Thus it puts a lot of burden on the data generator as opposed to being a complete black box solution. In particular, even if we adopt it, we still need to develop schemes for presenting e.g. image data (from 130+ commonly used formats) as 1d number vectors for the algorithm (because they are, at simplest, n-dimensional vectors). Other complaints could be turned to the poor performance and hardcoded choice of hashing algorithm, since we expect any algorithm to have a lifespan of only 10 years

So UNF is nice to look at and learn from, but I fear the design doesn't easily lend itself to widespread adoption. On the good side is their work on normalizing floating point values. Floating point is a truly complicated chapter

Yours sincerely,
Johan Henriksson

--
You received this message because you are subscribed to a topic in the Google Groups "Open Science Framework" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openscienceframework/UCkN6RZqGPg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openscienceframe...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
--
-----------------------------------------------------------
Johan Henriksson, PhD
Karolinska Institutet
Ecobima AB - Custom solutions for life sciences
http://www.ecobima.com http://mahogny.areta.org http://www.endrov.net

Philip Durbin

unread,

Feb 19, 2014, 8:58:40 AM2/19/14

to openscienc...@googlegroups.com, dataverse...@googlegroups.com

Good to know you're looking at UNF. It's used in Dataverse and I'll cc the Dataverse mailing list. The new file formats you're working on sound interesting.

Phil

Reply all

Reply to author

Forward