Graph.parse with RDFXML parser performance issues

Brendan McMahon

unread,

Dec 22, 2021, 12:32:08 PM12/22/21

to rdflib-dev

Dear rdflib contributors and maintainers,

I have recently been trying to update rdflib to version 6 from 4.2.2. Upon doing so, a process I normally run, which uses rdflib to load a large xml RDF file into a graph, has a significantly larger memory profile and latency (for my large file, parsing is taking about 1.5x as much time).

I've traced the issue back to the graph.parse method. More specifically, by profiling the graph.parse with versions 6.1.1 and 4.2.2, I can see that calls to access members of the RDF class (mostly occurring in the node_element_start method here as well as the property_element_start method) seem to be taking up a significantly longer time, as they the class is now a DefinedNamespace with overridden __getitem__ and __contains methods with added string checks.

Has anyone else experienced this issue? I have been trying to find ways to work around/with the library to lower the latency, but haven't been able to find anything yet.

Thanks,

Brendan

Nicholas Car

unread,

Jan 7, 2022, 6:38:43 AM1/7/22

to rdfli...@googlegroups.com

Hi Brendan,

This is an interesting issue! No I haven't encountered it, but then I never use large RDF/XMl graphs. How large is your graph by the way?

If you really think the issue is the getting or testing of elements in the RDF DefinedNamespace, couldn't you just clone rdfxml.py and replace all references to the RDF DefinedNamespace with references to a hard-coded set of URIRefs? You could try using that in place of the current rdfxml.py and see if there is a speedup. the file's only ~600 lines long, so a find 'n replace shouldn't be too impossible.

I would love to know how you go with this, if you try it. If it overcomes the problem, we may consider doing such a replacement within internal RDFlib files to improve performance and then providing the DefinedNamespaces for external use only, i.e. when people define RDFlib grapes with g.add() and use FOAF.givenName to represent URIs.

Cheers,

Nick

This email and any attachments may contain confidential and/or privileged information. If you are not the intended recipient of this message or their agent, or if this message has been addressed to you in error, please immediately alert the sender by reply email and then delete this message and any attachments. If you are not the intended recipient, you are hereby notified that any use, dissemination, copying, or storage of this message or its attachments is strictly prohibited.

--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/1f616ab0-187b-4af0-aca3-13d436c6dd71n%40googlegroups.com.

Brendan McMahon

unread,

Jan 7, 2022, 9:58:05 AM1/7/22

to rdflib-dev

Hi Nick, thanks for the response! Yes, the idea you mention is what I was considering trying next, but I thought I'd ask in here to see if there were any other ideas about handling this with what the library has built in. I will report back here with what I do if it works out! Also, the file is about half a gig. It takes ~25 minutes to parse.

Thanks,

Brendan

Wes Turner

unread,

Jan 7, 2022, 2:06:45 PM1/7/22

to rdfli...@googlegroups.com

Out of curiosity, does performance differ with defusedxml in there?

(RDF)XML parser complexity really is unnecessary compared to e.g. N3, JSONLD, or RDFHDT.

Does performance differ after transforming to a non-XML format?

defusedxml should probably be an install_requires dependency because of the XML parser vulnerabilities that it patches:

https://pypi.org/project/defusedxml/

To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/370e6b5c-06a1-4b4e-a6bc-b9ad492d7362n%40googlegroups.com.

Nicholas Car

unread,

Jan 7, 2022, 8:55:29 PM1/7/22

to rdfli...@googlegroups.com

I guess it depends on how you are producing the RDF/XML file in the first place. If you do have control over that, and loading times are really an issue, produce an n-triples file as this will load the fastest!

To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CACfEFw8JmMkF-wqPQJ9ideTYxJJU-NmBo%2BDu8AdZ35OXRZmBGw%40mail.gmail.com.

Wes Turner

unread,

Jan 7, 2022, 8:58:41 PM1/7/22

to rdfli...@googlegroups.com

RDFHDT is fast for *reads*; probably faaster than n-triples https://github.com/RDFLib/rdflib-hdt

To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CAP7nqh1P6V2TGhDZ0bnMhgHkcBV1E_4d8b5YagG3Pmr4kv1frg%40mail.gmail.com.

Nicholas Car

unread,

Jan 7, 2022, 10:04:16 PM1/7/22

to rdfli...@googlegroups.com

True, but producing an HDT file will likely require a lot more effort than an n-triples file. breanda, what's the thing generating the RDF/XML file in the first place?

To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CACfEFw_iVQ35sQVGYo3j3Qy9SnCFdu5e9%2Btc%2B4FhueyME15tsw%40mail.gmail.com.

Reply all

Reply to author

Forward