Parallelism, concurrency and blocking/batching

Frens Jan Rumph

unread,

Jan 15, 2021, 8:46:57 AM1/15/21

to RDF4J Users

Hi!

I'm evaluating the use of RDF4J and RDF in general. I have a number of use cases; some matching very well with the SAIL API, the triple stores, etc. But I'm also investigating whether I can use RDF4j to orchestrate information retrieval based on SPARQL across disparate sources.

There are a couple of challenges here, one being that these sources are slow / expensive. Therefore I'm wondering whether there have been any endeavours into parallelism, concurrency and blocking/batching within the query algebra evaluation? I reckon that this would also be useful for 'normal' triple stores that have considerable round trip times (see also some work in Halyard).

Is this an area of interest within the RDF4J community? Are there any thoughts on e.g. using rxjava or other 'platforms' to provide the basics for non-blocking flows instead of using the Vulcano model.

Interested in your views on the matter.

Best regards,
Frens Jan

hmott...@gmail.com

unread,

Jan 15, 2021, 9:39:29 AM1/15/21

to RDF4J Users

We have FedX for querying external sources in a distributed way with higher performance.

Håvard

Frens Jan Rumph

unread,

Jan 15, 2021, 10:38:06 AM1/15/21

to RDF4J Users

Thanks for your swift response Håvard,

I've looked into FedX (and the original federation implementation) and I have been looking into whether I'd be able to use these. Perhaps to clarify my context a little: my aim is to combine multiple sources that in themselves aren't SPARQL endpoints themselves and aren't backed by triple store or any other similar backend that allows querying for arbitrary statements.

The sources in our application are currently orchestrated with handwritten java code into predefined information analysis use cases. We operate in the intelligence / security domain, so the RDF model is a very clear fit. We have a forward chaining application that indexes source data and inferred statements. However, we have a complementary product line that is more of a backward chaining / query type of application. So essentially, we would like to annotate application code (java components) that are a sort of 'transformations' that perform lookups into remote sources (not just on single values, but on almost arbitrary graph patterns). SPARQL comes in from the query side to essentially declare the orchestration.

I have considered wrapping the aforementioned application code (java components) in FedX endpoints. There are some peculiarities with the ASK based probing that FedX performs for source selection, but that I could probably work with. But as the application components to be 'orchestrated' / 'federated' apply transformations (joins if you will, typically on (virtual) indices of composite keys), I think the best route would be to introduce custom optimisers that replace / add operators that add these 'transformations' to the query plan.

The whole idea very much has a federation ring to it, but the fact that we need quite fine grained control over the 'join patterns' is probably blocking. The block nested loop approach of FedX to perform could provide some inspiration for me here.

Sorry for the long write up. And I realise that even still it's rather abstract and out of the ordinary.

Hope you have some thoughts on the matter!

Best regards,

Frens Jan

hmott...@gmail.com

unread,

Jan 16, 2021, 9:42:16 AM1/16/21

to RDF4J Users

You may want to consider a commercial provider. A lot of today’s graph database build on Ontop (the federated query and reasoning solution built with money from EU Horizon 2029). A few also offer query level annotation to specify join order.

An option would be to contact Andreas Schwarte who is the main developer of FedX, maybe you can hire him through Metaphacts to improve FedX to meet your needs.

Håvard

hmott...@gmail.com

unread,

Jan 16, 2021, 9:55:51 AM1/16/21

to RDF4J Users

One more thing.

The ExtensibleStore might be worth looking at if you want to expose an underlying data source as RDF. It doesn’t help with the actual mapping aspect, but it does make it much easier to build something that supports SPARQL. Take a look at the simple test implementation in the tests directory.

https://github.com/eclipse/rdf4j/tree/master/core/sail/extensible-store

And if your data is XML based I would recommend a tool I wrote a few years ago (and is in use in production systems today).

https://github.com/AcandoNorway/XmlToRdf

Håvard

Andreas Schwarte

unread,

Jan 19, 2021, 2:41:08 AM1/19/21

to rdf4j...@googlegroups.com

Hi Frens Jan,

your use-cases and ideas sound interesting. As Håvard has pointed out, federation technologies together with a store that lifts your data to RDF might be a good fit for your use cases.

I am the main developer of FedX and would be interested in getting some more details. Also in this context I'd like to mention that at metaphacts (the company I am working for) we have a product called metaphactory, which incorporates a federation engine for hybrid information needs (supporting for instance to integrate information from RESTful services or relational databases with the main RDF database).

If you are interested to have a chat with me, feel free to reach out and we can arrange something.

Best,
Andreas

--
You received this message because you are subscribed to the Google Groups "RDF4J Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdf4j-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdf4j-users/ac8dac3c-cae7-4bba-aabf-a05daceb65b1n%40googlegroups.com.

Frens Jan Rumph

unread,

Jan 19, 2021, 1:35:02 PM1/19/21

to RDF4J Users

Hello Håvard, Andreas,

Thank you for your responses and "thinking along with me" (pardon the transliteration ;)).

I guess the work I'd like to do is really in the Ontology Based Data Access area. I'm trying to get comfortable with the lingo here, so bear with me. If only our sources were as simple as databases or APIs. We're in the open source intelligence space. So the data access that is currently orchestrated with 'workflows' in java has a tremendous variety. There are some fairly structured sources that we target, but there is a great deal of sources that are natural language, so we employ NLP and heuristics to extract information from them. So it isn't a matter of SQL/JDBC, Swagger/OpenAPI, json schema or whatever to RDF mapping.

My main interest now is to prototype an integration based on the strict evaluation strategy. I already have such a prototype based on Cypher (the graph query language from neo4j). So I don't see any major functional challenges.

On the performance side, I don't think the vulcano/iterator setup will fly. Our current application relies heavily on concurrency in a tree-wise scatter-gather approach. We really need to put multiple lookups to the sources to work at the same time in order to meet latency objectives.

Our workflows are currently tree-structured (call stacks essentially) with the top of the trees typically doing local stuff (information extraction, linking, etc.) and towards the leaves there is more IO work. So I could probably make a substantial with some well placed custom operators. Read aheads are important; fairly easy to integrate. And some joins patterns require a number of tuples in the same group (e.g. give me all names and locations around an entity to perform lookups, entity linking/matching, etc.).

That said, I can image that my fairly specific requirements would greatly benefit from a rxjava type of push+pull flow instead of a single threaded iterator model. Would such work be of interest to the RDF4J community. As indicated earlier, this would probably of use for triple stores with a fair amount of latency (as with elasticsearch-store, halyard, etc.). Or is/are the evaluation strategy/strategies considered more of a reference implementation and is anything beyond that left for commercial suppliers?

From this write-up it probably is clear that I'm really investigating the area. @Andreas, a chat sounds interesting. I'll reach out on LinkedIn.