xml schema file (*.xsd) to rdf via python?

487 views
Skip to first unread message

Donny Winston

unread,
Jan 26, 2022, 12:15:52 PM1/26/22
to rdfli...@googlegroups.com
I want to translate https://schema.datacite.org/meta/kernel-4.4/metadata.xsd to RDF (I guess using OWL or SHACL vocab ideally). I encountered e.g. https://github.com/srdc/ontmalizer but it's Java and, I haven't run java in over a decade. I couldn't find anything obvious in the rdflib codebase. Any suggestions? Using rdflib, or really just using python? :)

I'm not quite at the point where I'm ready to shell out $3500 for topbraid composer (I am told it can do this), and I'd prefer an open source solution anyway.

Best,
Donny

--
Donny Winston, PhD (he/him/his) | Polyneme LLC

If I've emailed you, I'd love to speak with you.
Schedule a meeting (15min+): https://meet.polyneme.xyz

Graham Higgins

unread,
Jan 26, 2022, 2:42:00 PM1/26/22
to rdflib-dev
Donny wrote:
I want to translate https://schema.datacite.org/meta/kernel-4.4/metadata.xsd to RDF (I guess using OWL or SHACL vocab ideally).
You may be in luck, if this 2021DataCite Ontology saves you the effort.

I couldn't find anything obvious in the rdflib codebase. Any suggestions? Using rdflib, or really just using python? :)
There isn't a Python implementation of XML->RDF, so there's nothing in the RDFLib codebase.

If the sparontologies implementation isn't any use to you, I'm not sure how far it'll get your in your task but this Java implementation  includes compiled jar files which include all the dependencies and can be run from the command line as described in the README

Cheers,
Graham
 

Donny Winston

unread,
Jan 26, 2022, 3:44:38 PM1/26/22
to JB
Thanks, Graham! Indeed, that sparontologies implementation should save a bunch of effort -- it is for v3.3 of the datacite schema rather than the latest v4.4, but it's a nice template.

Also, the allen501pc/XML2RDF Java implementation was usable from the command line as advertised! I was able to
```bash
java -jar XML2RDF-0.2-jar-with-dependencies.jar -s datacite-schema-kernel-4.4.xsd -o output.rdf
```
and then
```python
from rdflib import Graph
g = Graph()
g.parse("output.rdf")
g.serialize(...)
```
the output is...a bit messy, but it's RDF and thus programmatically accessible to me for alteration/repair. Thank you!

Best,
Donny
--
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.

Graham Higgins

unread,
Jan 26, 2022, 8:08:05 PM1/26/22
to rdflib-dev
Donny wrote:
Thanks, Graham! Indeed, that sparontologies implementation should save a bunch of effort -- it is for v3.3 of the datacite schema rather than the latest v4.4, but it's a nice template.
 I did wonder but it least it'll save you some effort.
 
Also, the allen501pc/XML2RDF Java implementation was usable from the command line as advertised!
Great, good to know it worked for you as well.

the output is...a bit messy, but it's RDF and thus programmatically accessible to me for alteration/repair. Thank you!
Yes, that remains an issue. Doesn't even import into Protege.

I did find this paper describing a Python-implemented approach to automatic ontology generation from XSD and it does reference RDFLib but there's no published code that I can find, which is a bit galling. I might well take a look into an implementation sometime.

 Cheers,
Graham

Graham Higgins

unread,
Jan 27, 2022, 11:29:49 AM1/27/22
to rdflib-dev
On Thursday, January 27, 2022 at 1:08:05 AM UTC Graham Higgins wrote:
the output is...a bit messy, but it's RDF and thus programmatically accessible to me for alteration/repair. Thank you!
Yes, that remains an issue. Doesn't even import into Protege.

Further to this and to Donny's original question ... After some more extensive fossicking around, I found a few efforts pursuing an RDFLib XSD-to-OWL solution and some more Python utilities for handling XSD. I'll deal with the latter first ...

I found xsdflatten which combined the split-into-several-file Datacite XSD spec into a single file of xsd and then generateDS produced some useful-looking classes from that flattened file and the result (an API in Python) does appear to be reasonably tractable.

I modified this tutorial example to:
> python generateDS.py -o datacite_api.py -s datacite_sub.py --super=datacite_api datacite.xsd
Looks like the "-s datacite_sub.py" is unnecessary in this case, the contents of that Python file don't seem useful to this specific task.

I also found this EOL'd PyXB ("a pure Python package that generates Python source code for classes that correspond to data structures defined by XMLSchema") and its Python3 fork PyXB-X which is an alternative approach to processing XSD content. It worked but without actually getting deeper into pyxb, whether the results are *useful* is another matter.

As regards Python/RDFLib/XSD->OWL, there is a Python implementation using RDFLib contained in a gist - which is referenced in this further, partly-functional refinement XSD2OWL by TB Huy which also uses RDFLib.

In the end, I managed to get an approximation of a conversion to RDFLib (clearly more work is required) with the flattened Datacite XSD and a handful of "get-it-working" changes to Pebbie's gist which I forked.

(fwiw, I'm viewing this as a potential "Cookbook" item)

Cheers,
Graham
Reply all
Reply to author
Forward
0 new messages