Gunnar, others,
I am making steady progress: I am essentially done with what I planned to do. I have, however, a question on the 'style' in RDFLib and how to accommodate this.
First what I did:
- I have now relative path for all modules, no more hack on sys.path. Bye bye Python 2.4 :-)
- I have actually *three* different parsers now. The third one is 'hturtle', meaning extraction of turtle that is embedded in an HTML file as part of a special <script> element, see
http://www.w3.org/TR/turtle/#in-html. I had that buried as part of the RDFa parser but, for RDFLib, I thought it is better to separate it as a specific parser for RDLib
- I have also defined a separate RDFa 1.0 parser (which is just a wrapper around the new RDFa parser setting the version explicitly; the user could also do that, but I thought this is just nicer to have).
- Here is how I have set up the various parsers in plugin.py (note that this means the old rdfa parser can be removed, as you suggested):
# The basic parsers: RDFa (by default, 1.1), microdata, and embedded turtle (a.k.a. hturtle)
register('hturtle', Parser, 'rdflib.plugins.parsers.hturtle', 'HTurtleParser')
register('rdfa', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
register('mdata', Parser, 'rdflib.plugins.parsers.structureddata', 'MicrodataParser')
register('microdata', Parser, 'rdflib.plugins.parsers.structureddata', 'MicrodataParser')
# A convencience to use the RDFa 1.0 syntax (although the parse method can be invoked with an rdfa_version keyword, too)
register('rdfa1.0', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFa10Parser')
# Just for the completeness, if the user uses this
register('rdfa1.1', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
# An HTML file may contain microdata, rdfa, turtle. If the user wants them all, the parser below simply invokes all:
register('html', Parser, 'rdflib.plugins.parsers.structureddata', 'StructuredDataParser')
# Some media types are also bound to RDFa
register('application/svg+xml', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
register('application/xhtml+xml', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
# 'text/html' media type should be equivalent to html:
register('text/html', Parser, 'rdflib.plugins.parsers.structureddata', 'StructuredDataParser')
(DanBri, what this means is that, for text/html, for example, all structured data will be extracted and smushed together. I hope that is what you would like, right?)
Now for the question.
RDFa 1.1 has a fairly precise notion on what to do with errors. In general, various parser errors (and some other errors in the content) are to be collected into a separate graph called 'processor graph'. There are only some very very rare cases when these errors become really ERROR-s, ie, that it would stop processing. The whole of RDFa1.1 parser but, in fact, the hturtle and the microdata parsers, too, are based on this philosophy; the user can add a separate graph to parsing, e.g.:
g.parse(source="something", format="html", pgraph=Graph())
and the error/warning triples are then added to pgraph. In some cases, of course, nothing will be parsed, or only partial parsing will happen due to problems, but the problems will be added to pgraph if any. If no pgraph is given, then, well, they are all lost. This seems to be in a slight contradiction with the rest of the parsers in RDFLib, which simply run into Python exceptions. So here is the question: what is the preferred approach for these parsers? Some options:
1. keep the behaviour as described above
2. keep this behaviour if the user provides a pgraph; if not then, at the end of the processing, raise an exception with the content (ie, the triples) of the virtual pgraph as an exception value
3. merge an internal pgraph and the user's graph at the end of the processing; ie, ignore the user setting and always expand the graph but do not raise exceptions
4. never accept a pgraph, but rather raise an exception with the error triples if there were any
#1 is in line with the RDFa spec and #2 can also be defended to be fine with it, #3 is a bit pragmatic, #4 is more the current RDFLib way.
Note that when raising an exception the value will be a, say, turtle dump of the whole pgraph, a pretty large exception value:-)
Advice on this? What should be the way to follow? Note that, in fact, I really like the approach of a separate pgraph for all parsers rather than running into Python exceptions, but I guess it is too late to change that...
Apart from that, I would still want to run some more tests, although the core of the parsers are unchanged and have been thoroughly tested (the advantage of not having changed the core code!). But we are almost there, the only coding I would have to do is to settle this error business.
Cheers
Ivan