External SPARQL Endpoints

47 views
Skip to first unread message

Matt Goldberg

unread,
Aug 20, 2020, 9:42:04 AM8/20/20
to TopBraid Suite Users
I've been experimenting with importing SPARQL endpoints via Import > Create Connection File For SPARQL Endpoint. This works great for smaller datasets, and having the endpoint wrapped as a virtual graph is a great feature I'd like to take advantage of. However, since it tries to cache all triples at that SPARQL endpoint, it is not practical for large datasets (e.g. DBPedia). Is there a way to configure a virtual graph for a SPARQL endpoint that does not try to cache the contents of the remote store?

Irene Polikoff

unread,
Aug 20, 2020, 9:46:12 AM8/20/20
to topbrai...@googlegroups.com
What exactly are you trying to accomplish?

You can use the SERVICE key word in SPARQL queries without having a connection fie.

On Aug 20, 2020, at 9:32 AM, Matt Goldberg <mgbe...@gmail.com> wrote:

I've been experimenting with importing SPARQL endpoints via Import > Create Connection File For SPARQL Endpoint. This works great for smaller datasets, and having the endpoint wrapped as a virtual graph is a great feature I'd like to take advantage of. However, since it tries to cache all triples at that SPARQL endpoint, it is not practical for large datasets (e.g. DBPedia). Is there a way to configure a virtual graph for a SPARQL endpoint that does not try to cache the contents of the remote store?

--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/topbraid-users/9749e6cc-0534-4934-8f39-d7ca8fe675a3n%40googlegroups.com.

Matt Goldberg

unread,
Aug 20, 2020, 10:16:14 AM8/20/20
to TopBraid Suite Users
Right, I know the SERVICE keyword does the trick, and that may be sufficient.

There's a couple paths I'm trying to explore:
  • We've been told that large, dynamic datasets would be better kept in another triple store, and we're looking at AllegroGraph as a possibility for that. It would be nice to have a data graph in AG appear as a graph in EDG as the vocabularies that would be used in the AG data graphs will be managed by EDG and we'd like EDG web services and SHACL validators to be able to easily access the AG data.
  • It would be convenient if there was a way to create a graph that could import several virtual graphs/connect to several external SPARQL endpoints in order to federate queries to multiple graphs simultaneously, in order to hide the fact that data may be coming from different sources. This would prevent our users from having to know what SPARQL endpoints exist and would give them just one to access all the data they would need in one place.

Irene Polikoff

unread,
Aug 20, 2020, 11:40:41 AM8/20/20
to topbrai...@googlegroups.com
Hi Matt,

Please see below

On Aug 20, 2020, at 10:14 AM, Matt Goldberg <mgbe...@gmail.com> wrote:

Right, I know the SERVICE keyword does the trick, and that may be sufficient.

There's a couple paths I'm trying to explore:
  • We've been told that large, dynamic datasets would be better kept in another triple store, and we're looking at AllegroGraph as a possibility for that.
Has someone at TopQuadrant give you this advice?

  • It would be nice to have a data graph in AG appear as a graph in EDG
This is not possible. Data is either in EDG or it is not in EDG.

  • as the vocabularies that would be used in the AG data graphs will be managed by EDG and we'd like EDG web services and SHACL validators to be able to easily access the AG data.
There are options for selectively copying some data which would then be available for services and validation. Copied data can be periodically refreshed.

One option to consider is described here https://www.topquadrant.com/technology/shacl/wikidata/

Note that the screenshots are from 6.2. Hopefully, it is still easy enough to follow the instructions in 6.4. We will update the screenshots shortly.

Further, the example describes a connection to Wikidata. There are some conveniences built-in to EDG for linking with Wikidata. You can, however, do the same with other SPARQL endpoint. You will not get the auto-suggestions for the resource links or auto-copying of shapes. You will need to establish these yourself. Once it is done, the fetching, copying and access to the property values of the linked remote resources works exactly the same.

  • It would be convenient if there was a way to create a graph that could import several virtual graphs/connect to several external SPARQL endpoints in order to federate queries to multiple graphs simultaneously, in order to hide the fact that data may be coming from different sources. This would prevent our users from having to know what SPARQL endpoints exist and would give them just one to access all the data they would need in one place.
If you have multiple SPARQL Endpoints, create a link property for each endpoint and provide links to the corresponding remote resources from all the endpoints. Also create separate Node Shapes representing data of interest from each endpoint.

You would then, for example, be able to fetch “height” from one endpoint and “weight” from another.

On Thursday, August 20, 2020 at 9:46:12 AM UTC-4 Irene Polikoff wrote:
What exactly are you trying to accomplish?

You can use the SERVICE key word in SPARQL queries without having a connection fie.

On Aug 20, 2020, at 9:32 AM, Matt Goldberg <mgbe...@gmail.com> wrote:

I've been experimenting with importing SPARQL endpoints via Import > Create Connection File For SPARQL Endpoint. This works great for smaller datasets, and having the endpoint wrapped as a virtual graph is a great feature I'd like to take advantage of. However, since it tries to cache all triples at that SPARQL endpoint, it is not practical for large datasets (e.g. DBPedia). Is there a way to configure a virtual graph for a SPARQL endpoint that does not try to cache the contents of the remote store?

--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/topbraid-users/9749e6cc-0534-4934-8f39-d7ca8fe675a3n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.

Matt Goldberg

unread,
Aug 20, 2020, 12:11:28 PM8/20/20
to TopBraid Suite Users
Great, I'll have to give this a try.

And yes, someone at TopQuadrant told us that very large, dynamic datasets would be better kept in another triple store.

Irene Polikoff

unread,
Aug 20, 2020, 1:31:05 PM8/20/20
to topbrai...@googlegroups.com
OK, let us know how it goes.

Btw, no one in TopQuadrant recalls giving you this advice. 

If you have very large (like 10 to the 12th) amounts of operational data in RDF, this would make sense. However, in this case, you would not try to make this data available in EDG. If this data uses controlled vocabularies/reference data managed by EDG, then the typical workflow would be:

1. Curate reference data in EDG i.e., EDG is the definitive source of reference data
2. Deliver it from EDG to other environments where it is used
3. Reference data gets updated in EDG, the external systems get updated

Thus, the flow would be from EDG to the external systems. This is a standard scenario for the “master reference data” solution. 

The scenario of going from the external sources to EDG would be limited to the reference data discovery i.e., the initial set up step when you are first establishing your reference datasets. If you already have various sources that use the controlled values, you would want to start by importing them into EDG where you would put them under management.

Regards,

Irene

Fan Li

unread,
Aug 21, 2020, 7:59:39 AM8/21/20
to TopBraid Suite Users
Thanks for the advice, Irene. Using EDG to provide data governance to existing operational databases/triple stores is relevant to us as well.

You mentioned for "very large (like 10 to the 12th) amounts of operational data, ... going from the external sources to EDG would be limited to the reference data discovery" . Do you imply there are more options if we have less than a 1 billion triples?
Reply all
Reply to author
Forward
0 new messages