BioHDT working group at BioHackathon 2016

Arto Bendiken

unread,

Jun 12, 2016, 1:08:40 AM6/12/16

to BioHDT mailing list

Everyone,

We're resurrecting the BioHDT effort here at the 9th NBDC/DBCLS
BioHackathon [1], taking place the week of June 12-18 here at Keio
University's Institute for Advanced Biosciences.

The working groups will be formed tomorrow morning, but existing
interested parties here already include at least:

• Alexander Garcia (Universidad Politécnica de Madrid)
• Arto Bendiken (Dydra)
• Evan Bolton (PubChemRDF @ NCBI)
• Michel Dumontier (Bio2RDF @ Stanford University)
• Raoul Bonnal (BioRuby @ Istituto Nazionale di Genetica Molecolare)

Our objectives for the working group this week include most
importantly improving the quality and performance of existing tooling
for generating HDT files. This because, unfortunately, thus far most
or all interested parties in bioinformatics who have attempted large
conversions into HDT, using either the Java or the C++ implementation,
have run into blockers in the quality and/or performance of the
tooling.

We therefore hope to improve tooling for HDT this week such that
several large gigaquad-range conversions can move forward, including
PubChemRDF and Bio2RDF. Javier has kindly given me GitHub collaborator
access to the HDT tooling [2], which will facilitate this work.

Other pressing issues of import for this community with regards to
using HDT (as it is today) concern first and foremost:

1. HDT's lack of native quad support. Right now, if somebody wishes to
publish a bioinformatics dataset as HDT, they'd have to publish one
file (.hdt.gz) per graph, which may entail thousands or tens of
thousands of files for one dataset. (Two files per graph if publishing
also the index files.)

2. HDT index files right now are not interoperable between
implementations [3], which is a problem for anybody who would like to
publish or consume HDT datasets--basically, right now, one wouldn't
publish the index file and anybody consuming the files would have to
generate it themselves, putting a damper on the utility of this
dataset publishing pipeline.

We probably won't have a chance to address these two more fundamental
issues this week, however, and as noted in my earlier email today
others elsewhere are in any case already tackling them.

For those who wish to follow along remotely, we'll be using this
BioHDT mailing list [4] and the BioHackathon 2016 wiki [5] for
collaboration.

Kind regards,
Arto

[1] http://2016.biohackathon.org/
[2] https://github.com/rdfhdt
[3] https://github.com/rdfhdt/hdt-cpp/issues/7
[4] https://groups.google.com/forum/#!forum/biohdt
[5] https://github.com/dbcls/bh16/wiki

Ruben Verborgh

unread,

Jun 12, 2016, 2:15:09 AM6/12/16

to bio...@googlegroups.com

HI all,

Great initiative!

Just wanted to say the I haven't found
the implementation-specific .hdt.index files a problem.
They're generated pretty fast anyway.

A bigger problem for me is large HDT generation,
which requires substantial machine specs.
There is a Hadoop effort in development
(showing that distributed HDT generation is possible),
but it is currently not stable enough.

Quad support would indeed be very nice
(see the other mail thread).

Best,

Ruben

Gang Fu

unread,

Jun 12, 2016, 2:16:16 PM6/12/16

to BioHDT

Thank you very much for bringing this up, Arto!

A couple of issues I have encountered to implement HDT in PubChemRDF data pipeline:

1) the hdt-java/hdt-fuseki can provide a RESTful interface on top of HDT files, which is very nice, however, I found this implementation does not support JSON-LD output format, which is a necessary now to serve RDF data.

2) I am wondering whether it is possible to add literal string text index, like lucene index to HDT index

3) As Evan has presented in the BioHackthathon meeting, the conversion to HDT files may take 50-100 GB memory for hundreds of millions of triples, which is not practical in data pipeline. If we break them down in different HDT files (more HDT conversions), the benefit of small data file size will be rendered. Although there is MapReduce solution, the codes to implement it and the documentation are sort of incomplete, so it is very hard to play with it. It would be nice that a more complete codes plus documentation can be provided for general users.

Best,

Gang

Arto Bendiken

unread,

Jun 17, 2016, 1:42:53 AM6/17/16

to BioHDT mailing list

The BioHackathon here in Tsuruoka is wrapping up today. I've posted
the slide deck from my working group wrap-up talk [1].

The working group participant list can be found in the BH16 wiki [2].

Many of the participants here also plan to attend SWAT4LS 2016 [3] in
Amsterdam this coming December, so latest then we'll have another
BioHDT hackathon.

Prior to that, though, there's much work to be done, as I outlined in
the last slide.

[1] https://speakerdeck.com/bendiken/biohdt-at-biohackathon-2016
[2] https://github.com/dbcls/bh16/wiki/BioHDT
[3] http://www.swat4ls.org/

Reply all

Reply to author

Forward