Everyone,
We're resurrecting the BioHDT effort here at the 9th NBDC/DBCLS
BioHackathon [1], taking place the week of June 12-18 here at Keio
University's Institute for Advanced Biosciences.
The working groups will be formed tomorrow morning, but existing
interested parties here already include at least:
• Alexander Garcia (Universidad Politécnica de Madrid)
• Arto Bendiken (Dydra)
• Evan Bolton (PubChemRDF @ NCBI)
• Michel Dumontier (Bio2RDF @ Stanford University)
• Raoul Bonnal (BioRuby @ Istituto Nazionale di Genetica Molecolare)
Our objectives for the working group this week include most
importantly improving the quality and performance of existing tooling
for generating HDT files. This because, unfortunately, thus far most
or all interested parties in bioinformatics who have attempted large
conversions into HDT, using either the Java or the C++ implementation,
have run into blockers in the quality and/or performance of the
tooling.
We therefore hope to improve tooling for HDT this week such that
several large gigaquad-range conversions can move forward, including
PubChemRDF and Bio2RDF. Javier has kindly given me GitHub collaborator
access to the HDT tooling [2], which will facilitate this work.
Other pressing issues of import for this community with regards to
using HDT (as it is today) concern first and foremost:
1. HDT's lack of native quad support. Right now, if somebody wishes to
publish a bioinformatics dataset as HDT, they'd have to publish one
file (.hdt.gz) per graph, which may entail thousands or tens of
thousands of files for one dataset. (Two files per graph if publishing
also the index files.)
2. HDT index files right now are not interoperable between
implementations [3], which is a problem for anybody who would like to
publish or consume HDT datasets--basically, right now, one wouldn't
publish the index file and anybody consuming the files would have to
generate it themselves, putting a damper on the utility of this
dataset publishing pipeline.
We probably won't have a chance to address these two more fundamental
issues this week, however, and as noted in my earlier email today
others elsewhere are in any case already tackling them.
For those who wish to follow along remotely, we'll be using this
BioHDT mailing list [4] and the BioHackathon 2016 wiki [5] for
collaboration.
Kind regards,
Arto
[1]
http://2016.biohackathon.org/
[2]
https://github.com/rdfhdt
[3]
https://github.com/rdfhdt/hdt-cpp/issues/7
[4]
https://groups.google.com/forum/#!forum/biohdt
[5]
https://github.com/dbcls/bh16/wiki