Extending HDT for quads

40 views
Skip to first unread message

Arto Bendiken

unread,
Jun 12, 2016, 12:16:56 AM6/12/16
to Javier D. Fernández, Michel Dumontier, Ruben Verborgh, BioHDT
Good morning Javier,

I recall you told me last week at ESWC that extending HDT for quads is
one of the items on your desk right now.

Are you aware of the parallel effort in that direction by Ruben
Verborgh under Michel Dumontier's direction? I imagine so, since you
and Ruben both were at ESWC, but wanted to check since these sounded
like independent efforts.

Greets from Tsuruoka,
Arto

Ruben Verborgh

unread,
Jun 12, 2016, 12:34:20 AM6/12/16
to Arto Bendiken, "Javier D. Fernández", Michel Dumontier, BioHDT, Ruben Taelman
Hi Arto,

> Are you aware of the parallel effort in that direction by Ruben
> Verborgh under Michel Dumontier's direction?

Just a small correction:
this was actually previously ongoing work
by a PhD student of mine (Ruben Taelman).
I'm not sure if I introduced you guys at ESWC.

What we're building is basically a workaround,
in which an LDF server uses multiple HDT files
(one for each graph + one for the graphs)
to serve quad data.

Native support in HDT would be much nicer.

That said, depending on how graphs are used,
another solution might be more interesting from an LDF angle.
If graphs are purely used to identify different datasets,
those datasets could actually just be different TPF interfaces
on which a client performs federated querying.
Having multiple TPF interfaces is much cheaper
than having multiple SPARQL endpoints,
so there is no real necessity to put multiple datasets
in a single endpoint by putting them in different graphs.

If, however, a single dataset has multiple graphs,
we need to look at real quad support
(through the multi-HDT workaround or native in HDT).

@Javier: what's the current status of quads in HDT?
Can we do anything to help (for instance, with hdt-cpp)?

Best,

Ruben

Arto Bendiken

unread,
Jun 12, 2016, 1:44:02 AM6/12/16
to Ruben Verborgh, BioHDT, Ruben Taelman
Hi Ruben,

On Sun, Jun 12, 2016 at 1:34 PM, Ruben Verborgh <ruben.v...@ugent.be> wrote:
>> Are you aware of the parallel effort in that direction by Ruben
>> Verborgh under Michel Dumontier's direction?
>
> Just a small correction:
> this was actually previously ongoing work
> by a PhD student of mine (Ruben Taelman).
> I'm not sure if I introduced you guys at ESWC.

Thanks for the correction. Yes, I did indeed briefly speak to your
namesake at ESWC.

> What we're building is basically a workaround,
> in which an LDF server uses multiple HDT files
> (one for each graph + one for the graphs)
> to serve quad data.
>
> Native support in HDT would be much nicer.
>
> That said, depending on how graphs are used,
> another solution might be more interesting from an LDF angle.
> If graphs are purely used to identify different datasets,
> those datasets could actually just be different TPF interfaces
> on which a client performs federated querying.
> Having multiple TPF interfaces is much cheaper
> than having multiple SPARQL endpoints,
> so there is no real necessity to put multiple datasets
> in a single endpoint by putting them in different graphs.

I see. A valuable workaround, to be sure, but at least at the moment
the publishing and distribution flow in the bioinformatics community
is largely based around putting up data dumps on FTP servers.

> If, however, a single dataset has multiple graphs,
> we need to look at real quad support
> (through the multi-HDT workaround or native in HDT).

Yes, we'd definitely like to see native support for quads to make HDT
maximally useful in all situations. Hopefully Javier has some thoughts
on how this could be done in a straightforward and
backwards-compatible manner.

Cheers,
Arto

Ruben Verborgh

unread,
Jun 12, 2016, 1:46:27 AM6/12/16
to Arto Bendiken, BioHDT, Ruben Taelman
Hi Arto,

>> Having multiple TPF interfaces is much cheaper
>> than having multiple SPARQL endpoints,
>> so there is no real necessity to put multiple datasets
>> in a single endpoint by putting them in different graphs.
>
> I see. A valuable workaround, to be sure, but at least at the moment
> the publishing and distribution flow in the bioinformatics community
> is largely based around putting up data dumps on FTP servers.

One doesn't contradict the other.

The main question is just:
are graphs being used only to differentiate datasets?
If so, I don't think the graphs are really meaningful
in those cases, and we can drop the fourth component.

Ruben

Arto Bendiken

unread,
Jun 12, 2016, 1:50:41 AM6/12/16
to Ruben Verborgh, BioHDT, Ruben Taelman
Hi Ruben,

On Sun, Jun 12, 2016 at 2:46 PM, Ruben Verborgh <ruben.v...@ugent.be> wrote:
> Hi Arto,
>
>>> Having multiple TPF interfaces is much cheaper
>>> than having multiple SPARQL endpoints,
>>> so there is no real necessity to put multiple datasets
>>> in a single endpoint by putting them in different graphs.
>>
>> I see. A valuable workaround, to be sure, but at least at the moment
>> the publishing and distribution flow in the bioinformatics community
>> is largely based around putting up data dumps on FTP servers.
>
> One doesn't contradict the other.

True.

> The main question is just:
> are graphs being used only to differentiate datasets?
> If so, I don't think the graphs are really meaningful
> in those cases, and we can drop the fourth component.

I don't think telling the community to drop their graphs is going to fly ;-)

In any case, graphs are used as a grouping mechanism within datasets,
i.e., a single logical dataset could contain tens of thousands (or
more) of graphs.

Cheers,
Arto

Ruben Verborgh

unread,
Jun 12, 2016, 2:09:36 AM6/12/16
to Arto Bendiken, BioHDT, Ruben Taelman
> I don't think telling the community to drop their graphs is going to fly ;-)

_Iff_ they are used _only_ to indicate the dataset,
i.e., the graph is the same for the entire dataset,
then I see no problem dropping it.

As an example for this case:
http://mo-ld.org/ contains 6 datasets.
There is 1 SPARQL endpoint for the 6 datasets,
and each dataset has its own graph.

If I were to publish this data as TPF,
I would simply create 6 interfaces
(1 per dataset) and execute a SPARQL query
over the federation of the interfaces,
where I can choose for each query
which of the 6 interfaces is included.

I would prefer this over publishing 6 datasets
in 1 QPF (Quad Pattern Fragments) interface.

> In any case, graphs are used as a grouping mechanism within datasets,
> i.e., a single logical dataset could contain tens of thousands (or
> more) of graphs.

That's a different case indeed;
for that, it's either multi-HDT or a single quad-HDT.
And definitely QPF.

Ruben

Michel Dumontier

unread,
Jun 12, 2016, 2:46:06 AM6/12/16
to bio...@googlegroups.com, Arto Bendiken, Ruben Taelman
Part of the issue here is that we normally add provenance metadata to
the graph name. so what do you suggest we do instead?

m.
Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com
> --
> You received this message because you are subscribed to the Google Groups "BioHDT" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biohdt+un...@googlegroups.com.
> To post to this group, send email to bio...@googlegroups.com.
> Visit this group at https://groups.google.com/group/biohdt.
> To view this discussion on the web visit https://groups.google.com/d/msgid/biohdt/CBC7A95C-5B4C-4D6C-AD94-25E878BF0994%40ugent.be.
> For more options, visit https://groups.google.com/d/optout.

james anderson

unread,
Jun 12, 2016, 7:04:40 AM6/12/16
to bio...@googlegroups.com
good morning;

On 2016-06-12, at 08:09, Ruben Verborgh <ruben.v...@ugent.be> wrote:

I don't think telling the community to drop their graphs is going to fly ;-)

_Iff_ they are used _only_ to indicate the dataset,
i.e., the graph is the same for the entire dataset,
then I see no problem dropping it.

As an example for this case:
http://mo-ld.org/ contains 6 datasets.
There is 1 SPARQL endpoint for the 6 datasets,
and each dataset has its own graph.

just to be careful with the terminology, that would depend on how the request specifies the sparql dataset.
the combinatorics of six available graphs between FROM and FROM NAMED yield more than six datasets.


If I were to publish this data as TPF,
I would simply create 6 interfaces
(1 per dataset) and execute a SPARQL query
over the federation of the interfaces,
where I can choose for each query
which of the 6 interfaces is included.

I would prefer this over publishing 6 datasets
in 1 QPF (Quad Pattern Fragments) interface.

for the mo-ld case, this could make sense if the use cases do turn out to be that one queries the graphs independently.
then, so long as no interdependence leads to transactional requirements, there could be advantages with respect to marshalling latency. 

best regards, from berlin,
---
james anderson | ja...@dydra.com | http://dydra.com





james anderson

unread,
Jun 12, 2016, 7:06:11 AM6/12/16
to bio...@googlegroups.com
On 2016-06-12, at 08:45, Michel Dumontier <michel.d...@gmail.com> wrote:

Part of the issue here is that we normally add provenance metadata to
the graph name.  so what do you suggest we do instead?

you would need to track provenance “out-of-band”, in its own repository with reference to the distinct repositories rather the graphs of the single repository.

if your current approach is to add the provenance information to its own graph, then, so long as you do not need to merge that metadata into the default graph, federating against a distinct provenance repository would provide logically equivalent access.


[…]


On Sat, Jun 11, 2016 at 11:09 PM, Ruben Verborgh
<ruben.v...@ugent.be> wrote:
I don't think telling the community to drop their graphs is going to fly ;-)

_Iff_ they are used _only_ to indicate the dataset,
i.e., the graph is the same for the entire dataset,
then I see no problem dropping it.
[…]

Javier D. Fernández

unread,
Jun 12, 2016, 6:28:52 PM6/12/16
to BioHDT, ruben.v...@ugent.be, ruben....@ugent.be
Hi, 

As for the status of HDT-Quads, I started the project (in c++) some months ago but I discontinued it. Now it's more relevant for my project focused on RDF versioning, as I can use such graphs to denote versions. We have already extended a similar triple index (different from HDT) for that purpose in a recent work "Self-Indexing RDF Archives": http://dataweb.infor.uva.es/wp-content/uploads/2016/01/dcc16.pdf. There, we used a bitsequence per version (of length n, the number of triples), where a 1-bit marks which triple is present in which version. My initial approach for HDT-Quads was a bit more fancy, but one could start with this approach. 

I should have time the following weeks and continue the project, I can start by creating a branch of the library and push the changes I had for the quads (e.g. extending TripleIDs with context, adding a NquadParser, add a dictionary of graphs, etc.).

Cheers

Arto Bendiken

unread,
Jun 13, 2016, 2:25:38 AM6/13/16
to BioHDT mailing list
Good morning Javier,

On Mon, Jun 13, 2016 at 7:28 AM, Javier D. Fernández
<jfer...@gmail.com> wrote:
> I should have time the following weeks and continue the project, I can start
> by creating a branch of the library and push the changes I had for the quads
> (e.g. extending TripleIDs with context, adding a NquadParser, add a
> dictionary of graphs, etc.).

This sounds promising, looking forward.

Do you envision that the transition to quad HDT (HDT4?) will create
any compatibility issues with opening/using existing triple-HDT files
going forward?

Kind regards,
Arto

jfernand

unread,
Jun 13, 2016, 4:51:58 AM6/13/16
to bio...@googlegroups.com, Arto Bendiken
Hi,

> Do you envision that the transition to quad HDT (HDT4?) will create
> any compatibility issues with opening/using existing triple-HDT files
> going forward?

No, I don't think so, you could use existing HDT files. Hopefully you
could also load quad files with the old library, disregarding the graph
component, but this is not guaranteed.

Cheers,
Javier

Fu, Gang (NIH/NLM/NCBI) [E]

unread,
Jun 13, 2016, 2:49:54 PM6/13/16
to bio...@googlegroups.com, Arto Bendiken, "Javier D. Fernández", Michel Dumontier, Ruben Taelman
Hi Ruben,

> What we're building is basically a workaround, in which an LDF server uses multiple HDT files (one for each graph + one for the graphs) to serve quad data.

Can you show me how you do " one for the graphs "? I guess you load a RDF triple file into the default graph, right? Any particular RDF statements we need to put into that file.

Without the index for quad, this query
Select distinct ?g where {Graph ?g {?s ?p ?o.}}
Will take forever...

How your workaround solution can answer such query?

Best,
Gang
--
You received this message because you are subscribed to the Google Groups "BioHDT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biohdt+un...@googlegroups.com.
To post to this group, send email to bio...@googlegroups.com.
Visit this group at https://groups.google.com/group/biohdt.
To view this discussion on the web visit https://groups.google.com/d/msgid/biohdt/CB663595-BD37-4F4B-8A15-7BF29F3BA5B8%40ugent.be.

Ruben Verborgh

unread,
Jun 13, 2016, 3:21:08 PM6/13/16
to bio...@googlegroups.com, Arto Bendiken, "Javier D. Fernández", Michel Dumontier, Ruben Taelman
HI Gang,

> Can you show me how you do " one for the graphs "?

I was thinking of a dataset like:
<g1> a :Graph.
<g2> a :Graph.
<g3> a :Graph.
possibly also containing triples
that connect a graph IRI to an HDT file for that graph.

Those triples would not really be part of the dataset,
but rather provide an index for the graphs.

> Without the index for quad, this query
> Select distinct ?g where {Graph ?g {?s ?p ?o.}}
> Will take forever...

Queries such as that one are the reason
this extra HDT file would exist.

Best,

Ruben
Reply all
Reply to author
Forward
0 new messages