RDF construction using python / rdflib?

217 views
Skip to first unread message

Miel Vander Sande

unread,
Apr 21, 2021, 8:16:06 AM4/21/21
to rdflib-dev

Hi all,

I'm not sure whether this is the right place for this questions, but AFAIK the RDF python community does not have a general community mailing list like RDF.js?

I was wondering whether there were any libraries / efforts using RDFLib to create ETL pipelines for constructing RDF from various sources? I could definitely use something like that, but couldn't really find anything yet. The RML-based tools don't really work that well for my use cases (json records) and they miss some transparency for debugging / iterop with other libraries when producing triples.

I was already starting to thing about a possible API and how it could leverage Dask or Spark to really scale up. I'm not a Python/data engineering expert, so this might come across as naive.

```
Mapping() # lazy execution pipeline object
.load(file1.json) # Creates graph from direct json mapping
.construct(query1) # Creates new graph containing mapping from file1.json graph
.construct(query2) # Creates new graph containing mapping from file1.json graph
.load(file2.json) # Creates graph from direct json mapping
.construct(query3) # Creates new map graph containing mapping from file2.json graph
.collect() # aggregates all constructed graphs into one
.check(shacl) # validate the constructed graph against mapping
.run() # actually runs the pipeline

```

Best,

Miel

Edmond Chuc

unread,
Apr 21, 2021, 8:23:14 PM4/21/21
to rdfli...@googlegroups.com
Hi Miel,

You can definitely create ETL pipelines with Apache Spark (using PySpark) and RDFLIb. Read the JSON records into a Spark dataframe and create triples in a graph with RDFLib. This is how we are producing triples from relational databases with Spark at my workplace. Don't forget to repartition the dataframe into multiple chunks to process chunks of the same dataframe in parallel. 

Cheers,

Edmond

--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/32767b1c-ad4c-4c4b-8447-1919154f2427n%40googlegroups.com.

Miel Vander Sande

unread,
Apr 22, 2021, 4:33:39 AM4/22/21
to rdfli...@googlegroups.com
Hi Edmund,

Great to hear! Maybe some follow-up questions:
- What do you use to perform the mapping? The rdflib api or SPARQL construct somehow?
- What's in your RDBs? Does it contain embedded json (that's what we have) or do you have plain tables?
- Do you use a PySpark Schema?

 Best,

Miel

Op do 22 apr. 2021 om 02:23 schreef Edmond Chuc <edmon...@gmail.com>:

Edmond Chuc

unread,
Apr 22, 2021, 6:36:04 AM4/22/21
to rdfli...@googlegroups.com
Hi Miel,

We run our Spark jobs as an embarrassingly parallel workflow. Each worker processes a partition of a dataframe with the RDFLib API to create the statements in memory. When each worker finishes processing their partition, they use HTTP to send the payload to a remote triplestore as files. These files are then bulk-loaded into the triplestore. 

It will be interesting to see if there's any performance improvement or cost by using SPARQL Update through the RDFLib SPARQLUpdateStore.

Our RDBMS contains just plain tables, no JSON objects, so it's probably a bit easier for us to work within the Spark dataframes compared to your case with the JSON files.

From memory, I think we let Spark infer the schema. If there are any problems with the inferred types, then we explicitly state them with PySpark's schema.

Hope this helps.

Cheers,

Edmond

Reply all
Reply to author
Forward
0 new messages