Interested in speeding up RDFlib SPARQL?

538 views
Skip to first unread message

Nicholas Car

unread,
Oct 31, 2020, 2:32:56 AM10/31/20
to rdfli...@googlegroups.com
Hi all,

The slowness of executing a SPARQL query against an RDFlib Graph annoys me so I often use loops over them instead, but this leads to, sometimes, verbose Python programming that's not translatable to triplestores with SPARQL interfaces and other programming languages etc.

I would love to see RDFlibs' graph.query function sped up so I could just use SPARQL queries!

Is anyone else on this list interested in this and do you have experience profiling Python to find out where the slowness occurs? If we knew where the slowness occurs, we might be able to find a way to speed things up.

Thanks,

Nick

--

______________________________________________________________________________________
kind regards
Dr Nicholas Car
Data Systems Architect at SURROUND Australia Pty Ltd
Address  P.O. Box 86, Mawson, Canberra ACT 2607
Phone     +61 477 560 177 
Email       nichol...@surroundaustralia.comWebsite   https://www.surroundaustralia.com

Enhancing Intelligence Within Organisations

delivering evidence that connects decisions to outcomes


Australian-National-University-Logo-1 – ANU Centre for Water and Landscape  Dynamics

Dr Nicholas Car
Adj. Senior Lecturer

Research School of Computer Science

The Australian National University
Canberra ACT Australia

 

 https://orcid.org/0000-0002-8742-7730

https://cs.anu.edu.au/people/nicholas-car 

Wes Turner

unread,
Oct 31, 2020, 11:46:35 AM10/31/20
to rdfli...@googlegroups.com
SnakeViz is one tool for visualizing the output of cProfile.

You can call cProfile in IPython with the %prun magic command.




Note that the %timeit magic command runs things very many times in order to isolate out transient sources of variance in timing.

In addition to the other processes that could be running on a system under test, Sysdig can monitor application performance with Tracers:  



 https://pymotw.com/2/profile/ "profile, cProfile, and pstats – Performance analysis of Python programs."

--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CAP7nqh0RcyZXDScSpB8a2LgkYiRd2YHCbLrodoqbtHL2XiCyYg%40mail.gmail.com.

Boris Pelakh

unread,
Oct 31, 2020, 4:31:56 PM10/31/20
to rdfli...@googlegroups.com
I don't have that much experience with Python profiling, but can take a shot at getting some measurements. Is there a particular benchmark (LUBM? BSBM) that would make a good starting point?

--

Nicholas Car

unread,
Nov 2, 2020, 6:22:05 AM11/2/20
to rdfli...@googlegroups.com
I think tests against benchmarking datasets like LUBM should be done secondarily. Firstly I think general Python timing testing like Wes mentioned should be carried out. This is because I suspect the speed issues might be to do with whole chunks of Python code executing even before the actual SPARQL execution (perhaps the query parsing rather than execution etc.) but this needs to be established and that's what I think is needed initially. After that, then LUDM indeed: we need to know how RDFlib compares to triplestores etc.

I will try and look into using a tool like SnakeVis.

Thanks for the tips!

Jörn Hees

unread,
Nov 2, 2020, 6:56:41 AM11/2/20
to rdflib-dev
I vaguely remember that Gunnar had left some issues back in the day for some ideas on SPARQL optimizations (e.g., a "most constrained first heuristic"), so it might be worth digging those up before jumping into anything else...
IIRC the original code was more tailored towards simplicity and correctness than speed.

Be warned, that parsing and turning the results into an algebra that is then executed causes a lot of indirections, making the SPARQL part quite difficult to debug (and hence probably to pofile)...
I remember that it's easy to get lost in there for hours, especially if you don't understand the big picture of that code...
So sure, profiling the code is always a good idea, but maybe don't start with the test cases that get us close to (or above) the stack depth limit anyhow ^^

j

Gunnar Aastrand Grimnes

unread,
Nov 2, 2020, 9:52:51 AM11/2/20
to rdfli...@googlegroups.com
I would say there's roughly 3 reasons why the sparql engine is slow: 

* Pyparsing is incredibly slow, in particular, constructing the grammar is slow (so startup time is bad), and the parsing is slow. You can preparse your queries. But really pyparsing is kinda long in the tooth - I would recommend replacing with https://github.com/erikrose/parsimonious

* We have a shit query planner :D We have this: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/algebra.py#L97
A real triple store will keep track at least of the number of usages for each predicate, so you can do the least frequent first. We only order by number of bound components. 
You can also do some rewriting of the algebra, to do less stupid things - we don't do any of that. 

* Python is really slow :D Numpy is fast because all the inner code is in C or Fortran and all the data is super strongly typed. We have weakly typed objects we throw around in dicts all the time, it just takes time. A native python sparql engine will never rival Jena/Fuseki. 

There are probably some things you can do differently in the various processing methods - but in the end, most of it is simply looping over triples. If you loop in the wrong order, and loop over 5M things, and not 5 - no amount of code trickery will help. 

(As Jörn says - debugging/profiling this is often tricky, since everything is an iterator) 

Good luck! :D 

- Gunnar



--
Reply all
Reply to author
Forward
0 new messages