Welcome to the BioHDT mailing list

25 views
Skip to first unread message

Arto Bendiken

unread,
Dec 14, 2015, 8:36:31 AM12/14/15
to BioHDT mailing list
Hi everyone,

We're still missing a couple of people here, but most everyone I
invited are now signed up and the rest can read the archives--so let's
get started. Please do go ahead and invite any interested colleagues
and anyone we might have overlooked; the member list [1] is visible to
all current members.

The impetus for this group is the considerable interest at the SWAT4LS
hackathon in Cambridge [2] last week in the use of the compact HDT
binary file format for RDF data [3] to publish bioinformatics
datasets. The question posed several times was "why didn't we know
about this?"

Several major bio datasets are now in the process of being converted
to HDT. Alexander had already converted the Reactome dataset [4,5] to
HDT previously, and Michel has been working on converting Bio2RDF [6].

During the hackathon, Evan & Gang began a conversion of PubChemRDF [7]
and Atsuko & Yasunori completed converting Allie [8]. Núria also
expressed an interest in converting DisGeNET datasets.

Egon and Atsuko & Yasunori independently succeeded in executing SPARQL
queries directly on local HDT files using the Jena adapter from the
HDT/Java [9] project. Atsuko & Yasunori & Arto also published part of
the Allie ontology on Dydra using Dydra's native HDT storage backend
[10].

Egon began work on a Bioclipse plugin for HDT, which he has since then
completed. I've invited Egon to post a summary of that work here, as
I'm sure many of you are interested in checking that out.

Evan presented a summary of the day's findings at the wrap-up session
for the hackathon. If the slide deck could be considered public,
perhaps Evan might be so kind as to post it to the list?

The purpose of this group is to bring all these conversations about
HDT for bioinformatics to one channel to the benefit of everyone.
(Currently, I have a dozen follow-up email threads going on, which is
unwieldy and also inevitably excludes some interested parties.)

I've also invited the original HDT spec & tooling authors to join the
group, and Mario Arias [11] already has. So, we have a lot of
expertise in this room, let's make use of it to figure out and realize
the benefits that HDT can bring to the bioinformatics community!

[1] https://groups.google.com/forum/#!members/biohdt
[2] http://www.swat4ls.org/workshops/cambridge2015/programme/hackathon/
[3] http://www.rdfhdt.org
[4] https://www.ebi.ac.uk/rdf/services/reactome/
[5] https://github.com/alexgarciac/gittemp
[6] http://bio2rdf.org/
[7] https://pubchem.ncbi.nlm.nih.gov/rdf/
[8] http://allie.dbcls.jp/
[9] https://github.com/rdfhdt/hdt-java
[10] http://dydra.com/bendiken/allie-ontology
[11] https://github.com/MarioAriasGa

--
Arto Bendiken | @bendiken | @dydradata
SWAT4LS 2015-12-09.pdf

Javier D. Fernández

unread,
Dec 14, 2015, 12:10:29 PM12/14/15
to BioHDT
Hi all, 

I'm Javier D. Fernández, one of the co-authors of HDT. It's really a pleasure to see that this project is useful to some extend, specially if this could help the  bioinformatics community. 

Please let us know how we can help. From the point of view of the implementation, Mario Arias is the main developer, but of course we all can try to solve doubts in this respect. 

From our side, it would be great to find novel requirements. For instance, I'm currently working on compressed archiving, i.e. to efficiently store and query different versions of a dataset. I know that versioning has been mentioned by the health care and life sciences domain[1], so you could also have some nice use cases that you may want to share and discuss.


All the best,
Javier D. Fernández
WU Vienna
Institute for Information Business

Bas Stringer

unread,
Dec 14, 2015, 4:24:36 PM12/14/15
to bio...@googlegroups.com
Hi Javier,

The compression ratios I've seen so far are very impressive, and work like LDaaS [1] clearly demonstrates how valuable a contribution HDT is to making Linked (Open) Data accessible. Regarding novel requirements/features:

1) I'd personally like to see an extension of the implementation to support Quads. As far as I understand, this is currently achievable by sticking every graph in a file of its own, but I reckon there's advantages to having multiple graphs represented in a single file. My intuition is that this will negatively impact the rather amazing compression rates, but depending on the extent of that impact, it may be a price worth paying. Is it true this is currently being developed? If so, how far along is it?
2) A Python wrapper for the currently provided libraries/functions would be very useful for silly biologists like myself, who aren't particularly trained in the programming department.
3) I'm not entirely certain how feasible this is, but the ability to intelligently merge or split HDT files, without having to re-compress them entirely from scratch, would be very helpful in keeping (very) large datasets with rapidly changing content up to date. Think for example of WikiData, which constantly has entries added (removed?) and edited. Even if these operations result in (slightly) suboptimal compression rates, not having to re-compress from scratch to immediately incorporate changes to the data seems like it would have great value.

Kind regards,
Bas Stringer

[1] 


--
You received this message because you are subscribed to the Google Groups "BioHDT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biohdt+un...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at https://groups.google.com/group/biohdt.
To view this discussion on the web visit https://groups.google.com/d/msgid/biohdt/c86acf1a-6379-49e9-9224-3719aef3e776%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Bas Stringer

unread,
Dec 14, 2015, 4:25:15 PM12/14/15
to bio...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages