Dear all,
Quick status update. I just spoke to Laurens [Rietveld] briefly. Turns out he's only using HDT -- not developing it. He did give me two names of the people that are developing it. Apparently, Javier Fernandez and Axel Polleres are currently working on extending HDT to Quads.
We also discussed some things regarding performance and issues they encountered [with LDaaS] [1]:
- Data retrieval through the Triple Pattern Fragments (TPF) API, using HDT, is relatively fast, simple and cheap. However, it also means more complex queries (e.g. SPARQL, requiring table joins, filters, aggregation... between such fragments) become very slow, as entire fragment has to be sent to the client before being joined and you don't use any of the indexes dedicated SPARQL endpoints do. Among others, this causes timeouts on 5 of the FedBench queries in [1], figure 3, which are hard to get around without moving away from TPF.
- Laurens mentioned a student who worked on refining "the greedy implementation where the set of sent HTTP requests
is a naive Cartesian product between the set of fragments and the datasets" -- results vary from "no improvement" to "only half as slow as a SPARQL endpoint", depending on the complexity of the query.
- The native implementation of TPF uses memory-mapping to enhance performance of file I/O, but in reality, this performance gain is relatively small (at least for small files). Because having too many files mapped in this way causes instability, LDaaS doesn't use it, except for very large files.
Hope this helps.
Cheers,
Bas
@ Gang: Have you looked at the HDT-MR I linked earlier? [2] Apparently, using MapReduce decreases the amount of memory needed for compression. Not sure if it reduces it enough to become viable for PubChem's data, but it might be worth a shot.