Re: Strategy to add pyRdfa3, i.e., RDFa 1.1 parser, to the rdflib core distribution

59 views
Skip to first unread message

Gunnar Aastrand Grimnes

unread,
Sep 25, 2012, 7:29:25 AM9/25/12
to Ivan Herman, Niklas Lindström, rdf...@googlecode.com, Dan Brickley, rdfli...@googlegroups.com
Hi all,

I wonder if I have a slightly cleaner solution to all of this.

I do not like the duplication of code in rdflib and the
pyrdfa/pymicrodata projects.

Since both those projects depend on rdflib anyway, why not make them
into rdflib plugin-providing projects. Then just installing them will
mean the new parsers are available to rdflib.

I've moved the code around as needed in my forks here:

https://github.com/gromgull/pymicrodata
and
https://github.com/gromgull/pyrdfa3

The changes to each project is minimal, a bit of setuptools specific
code in setup.py, adding the dependencies on rdflib and the
entry_points for automatically registering the parsers
I've put the parsers, i.e. the implementation of RDFLib Parser API in
a rdflib.py file in each project.
The generic StructuredDataParser I've put in the RDFa project.

This now works with stock rdflib, for instance with the 3.2.2 I just
released. If either of the projects are installed, the parsers are
automatically available.
The entry_point registered parsers overrides the default parsers, so
this will override the current rdfa 1.0 parser.

You can try it now. In a clean environment:

pip install https://github.com/gromgull/pyrdfa3/zipball/master
[...]
pip install https://github.com/gromgull/pymicrodata/zipball/master
[...]

python

>>> import rdflib
>>> g=rdflib.Graph()
>>> g.load('http://ivan-herman.name/', format='html')
>>> print g.serialize(format='n3')
@prefix md: <http://www.w3.org/ns/md#> .
@prefix og: <http://ogp.me/ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> .

<http://ivan-herman.name/> og:image
"http://1.gravatar.com/blavatar/b36c82dd81cc7fc066d729227bbf8cba?s=300"@en;
og:site_name "Ivan’s private site"@en;
og:title "Ivan’s private site"@en;
og:type "blog"@en;
og:url "http://ivan-herman.name/"@en;
md:item () .

... etc.

This strikes me as a much cleaner and tidier solution, avoiding the
problem of maintaining the code in two places.

Possible counter arguments are:
* rdfa/microdata are not parts of CORE rdflib. Which is a shame, but
we can register them as optional "extra_requires" for rdflib, allowing
them to be automatically installed with rdflib like this:

pip install rdflib[RDFa,Microdata]

* there is a tiny bit of extra "noise" in the rdfa/microdata projects
- but since they require rdflib anyway, there is no extra burden on
the user.

What do you think?

If you approve Ivan, I can make github pull requests and you should be
able to merge the changes with a single click.

Cheers,

- Gunnar




On 31 August 2012 16:38, Ivan Herman <ivan....@gmail.com> wrote:
> Guys,
>
> I have followed Niklas' advice (and found out how to do it:-) and created a separate branch on the repo:
>
> https://github.com/RDFLib/rdflib/tree/structured_data_parsers
>
> This branch includes now:
>
> - RDFa 1.1 parser package
> - microdata->RDF parser package
> - a structuredata.py module to interface these, which also includes a 'joint' parser. Ie, one can say
>
> g = Graph()
> g.parse(URI, format="html")
>
> which will extract any triple in the file in RDFa 1.1, in microdata, or as embedded turtle (in a <script type="text/turtle">...)
>
> I have also spent quite some time to make the code Python 3 compatible and, as far as I could see, it is done. The 2to3 script helped me a lot and there weren't that many big differences. Caveat: I could not really test it under python3, because I have not installed a python3 version of rdflib on my machine and there is no python 3 version of html5lib either.
>
> I really tried to push all this out of the door now; next week vacations are over all over the place and I do not trust I will have the time to work on this much more. I am pleased that I could do that much, though, it is more that I thought I could do.
>
> I plan to write a public blog about the availability of this; maybe some other, good souls will look at it, test it further, etc...
>
> Cheers, and have a good week-end!
>
> Ivan
>
>
> ----
> Ivan Herman
> Bankrashof 108
> 1183NW Amstelveen
> The Netherlands
> http://www.ivan-herman.net
>
>
>



--
http://gromgull.net

Ivan Herman

unread,
Sep 25, 2012, 9:15:23 AM9/25/12
to Gunnar Aastrand Grimnes, Niklas Lindström, rdf...@googlecode.com, Dan Brickley, rdfli...@googlegroups.com
Gunnar,

I am not sure.

I regard the separate pyrdfa and pymicrodata projects as ephemeral. If I had more time, then I would completely turn my deployed services upside down, would use the latest version of rdflib with those in it as the application interface for the services and would then forget about the old.

(That being said, that would mean that users who want to access rdfa/microdata through older versions of rdflib would have a problem.)

The main reason why I would not prefer to go with what you propose is what you actually say: if RDFa/microdata is not part of the core distribution, I believe that would be bad. We really would like these to be integral part of the core rdf landscape...

So... I would propose to keep the current structure with the knowledge that the separate projects will, eventually, die out (my time permitting). In the meantime, let the synchronization of these be my problem...:-)

Ivan
----
Ivan Herman
4, rue Beauvallon, clos St Joseph
13090 Aix-en-Provence
France
http://www.ivan-herman.net

Ed Summers

unread,
Sep 25, 2012, 9:49:34 AM9/25/12
to rdfli...@googlegroups.com
On Tue, Sep 25, 2012 at 9:15 AM, Ivan Herman <ivan....@gmail.com> wrote:
> The main reason why I would not prefer to go with what you propose is what you actually say: if RDFa/microdata is not part of the core distribution, I believe that would be bad. We really would like these to be integral part of the core rdf landscape...
>
> So... I would propose to keep the current structure with the knowledge that the separate projects will, eventually, die out (my time permitting). In the meantime, let the synchronization of these be my problem...:-)

I agree that it would be good for rdflib to have parsing support for
RDFa (1.0/1.1) as well as microdata. But I'm confused about the
current state of affairs. Do we currently have a rdflib branch where
there is parsing support for rdfa (1.0/1.1) and microdata. Is there a
plan to merge it into master and make it part of rdflib proper? Do we
have tests for the new code?

Since there is already support (or was the last time I checked) for
rdfa 1.0 in rdflib I think it is a no-brainer to add the 1.1 support
as long as it can still process rdfa 1.0, and can be used in the same
way at the API level.

//Ed

Gunnar Aastrand Grimnes

unread,
Sep 25, 2012, 10:06:33 AM9/25/12
to rdfli...@googlegroups.com
Hi Ed et al,

There is rdfa 1.0 support in rdflib core currently.

Niklas and Ivan made a branch for RDFa 1.1 and Microdata support here:

https://github.com/RDFLib/rdflib/tree/structured_data_parsers

I do of course agree that having this is great!
It's also especially great, since you can be format agnostic, throw
rdflib at webpages and get whatever data exists out, like Ivan wrote
here:

http://ivan-herman.name/2012/08/31/rdfa-microdata-turtle-in-html-and-rdflib/

My concern was that the inclusion was done by copying the source-tree
of the other two projects, giving us two problems:

* the two copies of the code will have to be maintained in parallel
* to avoid changing the imports in the existing path, all interfaces
between the rdfa/microdata code and rdflib is surrounded with with
code that messes with sys.path:

https://github.com/RDFLib/rdflib/blob/structured_data_parsers/rdflib/plugins/parsers/structureddata.py#L35

which seems dirty and brittle.

My solution was to keep rdfa and microdata as separate projects,
easily installed and with their parsers then automatically plugged in.

But if the general consensus is that this is fine, I am happy to stop
being fascist about it and merge it now!

Cheers,
- Gunnar
> --
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To post to this group, send email to rdfli...@googlegroups.com.
> To unsubscribe from this group, send email to rdflib-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Dan Brickley

unread,
Sep 25, 2012, 10:30:13 AM9/25/12
to Ivan Herman, Gunnar Aastrand Grimnes, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
I don't know the implementation details and tradeoffs, but here's a wishlist.

1. Anyone working in Python who goes looking for a solid library to
handle Microdata (for schema.org, most typically) "gets for free" an
installation that will equally well parse out RDFa Lite markup into a
similar Python representation. And if they dig around, they'll perhaps
find other things useful (e.g. SPARQL 1.1, eventually...).

2. Conceptual (forcing a triples-centric view) or software engineering
(e.g. dependency chains pulling in things that might break) aspects of
the RDF dependency shouldn't make Microdata enthusiasts curse this
entanglement and wish for 'something simpler'.

3. Anyone working in RDFa in Python gets a modern
up-to-date-with-the-specs 1.1 environment, and similarly 'for free' in
terms of hassle and distraction, a working modern Microdata parser
than deals with XHTML, HTML and tag soup without complaint.


General goal being to reduce the cost of choosing one of these options
(microdata, rdfa lite) over the other, and to gently introduce
Microdata users to the finer achievements of the RDF world, rather
than our communal tendency to inflict esoteric details first.

A guy can dream :)

I think we're close...

Dan

Gunnar Aastrand Grimnes

unread,
Sep 27, 2012, 4:52:03 AM9/27/12
to Dan Brickley, Ivan Herman, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
Right - I don't really have a big problem integrating it the way it is
the branch, I just wanted to show the other way.

However, thinking about it, the direct integration probably is better :

* I probably underestimate the number of people who will be
confused/give up when installing extra packages is required, to me it
looks so easy :)

* We already got burned by "developer cleanliness" vs. "simple for the
user" when we split off sparql into rdfextras (although, it did lead
to a clean test-suite that actually passes, which has helped a lot) -
this time I'd rather err in the other direction :)

HOWEVER,

the current sys.path manipulation doesn't work if rdflib is installed
as unextracted egg-file (quite when this happens is not clear to me,
in windows it does, on linux)

Was it possible to solve it also with relative imports Ivan? Then we
can ditch 2.4 support for 3.3 release and just use those?

I'll announce this in another email to rdflib-dev and see if anyone
REALLY REALLY needs 2.4 support to go on.

How is py3 support looking? In my test, html5lib doesn't install
cleanly with pip - but manually installed the branch from the python3
folder in the html5lib repos make it die deep inside html5lib with:

File "/home/ggrimnes/projects/rdflib/rdflib/testenv3/lib/python3.2/site-packages/html5lib-0.95-py3.2.egg/html5lib/inputstream.py",
line 442, in detectEncoding
encoding = self.detectBOM()
File "/home/ggrimnes/projects/rdflib/rdflib/testenv3/lib/python3.2/site-packages/html5lib-0.95-py3.2.egg/html5lib/inputstream.py",
line 523, in detectBOM
self.rawStream.seek(encoding and seek or 0)
io.UnsupportedOperation: seek

Cheers,

- Gunnar
--
http://gromgull.net

Ivan Herman

unread,
Sep 27, 2012, 11:28:23 AM9/27/12
to rdfli...@googlegroups.com, Dan Brickley, Niklas Lindström, rdf...@googlecode.com

On Sep 27, 2012, at 04:52 , Gunnar Aastrand Grimnes wrote:

> Right - I don't really have a big problem integrating it the way it is
> the branch, I just wanted to show the other way.
>
> However, thinking about it, the direct integration probably is better :
>
> * I probably underestimate the number of people who will be
> confused/give up when installing extra packages is required, to me it
> looks so easy :)
>
> * We already got burned by "developer cleanliness" vs. "simple for the
> user" when we split off sparql into rdfextras (although, it did lead
> to a clean test-suite that actually passes, which has helped a lot) -
> this time I'd rather err in the other direction :)
>
> HOWEVER,
>
> the current sys.path manipulation doesn't work if rdflib is installed
> as unextracted egg-file (quite when this happens is not clear to me,
> in windows it does, on linux)
>
> Was it possible to solve it also with relative imports Ivan? Then we
> can ditch 2.4 support for 3.3 release and just use those?

Just to clarify: AFAIK, relative imports are not available in Python 2.4, so are you asking whether I would be ready to change the code to use relative import, knowing that this would make it incompatible with 2.4? My answer is yes, I am ready to do this (although I am not sure when, my days are a bit hectic these days) but it is up to you whether this is acceptable for RDFLib as a whole.

My own deployed application on W3C would be fine, because we used 2.5 or 2.6.

>
> I'll announce this in another email to rdflib-dev and see if anyone
> REALLY REALLY needs 2.4 support to go on.
>

O.k. Let us see what happens...

> How is py3 support looking? In my test, html5lib doesn't install
> cleanly with pip - but manually installed the branch from the python3
> folder in the html5lib repos make it die deep inside html5lib with:
>
> File "/home/ggrimnes/projects/rdflib/rdflib/testenv3/lib/python3.2/site-packages/html5lib-0.95-py3.2.egg/html5lib/inputstream.py",
> line 442, in detectEncoding
> encoding = self.detectBOM()
> File "/home/ggrimnes/projects/rdflib/rdflib/testenv3/lib/python3.2/site-packages/html5lib-0.95-py3.2.egg/html5lib/inputstream.py",
> line 523, in detectBOM
> self.rawStream.seek(encoding and seek or 0)
> io.UnsupportedOperation: seek
>

Yeah, the problem is with html5lib, and I cannot really answer that. I have gone through my own code with 2to3 and subsequent manual edits, and I hope to have made all the necessary changes to make the code compatible with Python3. However, I could not test it with Python3, because I do not have a proper local installation and also due to the html5lib issue. Although I would not expect major issues on my code though it still need basic testing.

Cheers

Ivan
> --
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To post to this group, send email to rdfli...@googlegroups.com.
> To unsubscribe from this group, send email to rdflib-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


Ivan Herman

unread,
Oct 4, 2012, 2:36:03 PM10/4/12
to Gunnar Aastrand Grimnes, Dan Brickley, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
Gunnar,

I would like to know how this will be decided and when. I have changed the local versions of the pyrdfa3 and microdata packages to use relative URI-s; and I am happy to change the version that is included in the RDFLib package; I do not think it would take a lot of time. But I would like to do that when I am sure that we have dropped 2.4...

Cheers

Ivan

Gunnar Aastrand Grimnes

unread,
Oct 5, 2012, 3:46:38 AM10/5/12
to Ivan Herman, Dan Brickley, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
RDFLib doesn't really have a decision process - but while we wait for
a council of rdflib elders to form, I can put on the "benevolent
dictator" hat (available to anyone who wants it!) and decide:

* the next version of rdflib will no longer support python 2.4

If anyone has a problem with this, they should have spoken up in this
thread sooner :)

So please go ahead and make the changes in the structured parsers
branch! Then we can merge it soon.

Also, since I am already making bold decisions, how about deleting the
old rdfa 1.0 parser? I cannot see this being maintained now, and you
said the new one still works for most rdfa 1.0 content?

Cheers,

- Gunnar
--
http://gromgull.net

Dan Brickley

unread,
Oct 5, 2012, 5:32:09 AM10/5/12
to Gunnar Aastrand Grimnes, Ivan Herman, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
On 5 October 2012 08:46, Gunnar Aastrand Grimnes <grom...@gmail.com> wrote:
> RDFLib doesn't really have a decision process - but while we wait for
> a council of rdflib elders to form, I can put on the "benevolent
> dictator" hat (available to anyone who wants it!) and decide:
>
> * the next version of rdflib will no longer support python 2.4
>
> If anyone has a problem with this, they should have spoken up in this
> thread sooner :)

I support that, having just double-checked Google App Engine is OK
with it - yup https://developers.google.com/appengine/docs/python/python25/
where they write "App Engine currently supports two versions of
Python: 2.5 and 2.7. We recommend the use of Python 2.7, which
incorporates many new features, including multithreading, concurrent
requests, and a variety of updated libraries and features.".

> Also, since I am already making bold decisions, how about deleting the
> old rdfa 1.0 parser? I cannot see this being maintained now, and you
> said the new one still works for most rdfa 1.0 content?

The sooner we get the world onto 1.1 and mostly Lite, the better. Are
there any major 1.0 deployments anyone's aware of, who are unable to
evolve and adopt 1.1?

Dan

Ivan Herman

unread,
Oct 5, 2012, 7:00:35 PM10/5/12
to Gunnar Aastrand Grimnes, Dan Brickley, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com

On Oct 5, 2012, at 03:46 , Gunnar Aastrand Grimnes wrote:

> RDFLib doesn't really have a decision process - but while we wait for
> a council of rdflib elders to form, I can put on the "benevolent
> dictator" hat (available to anyone who wants it!)

:-)

> and decide:
>
> * the next version of rdflib will no longer support python 2.4
>
> If anyone has a problem with this, they should have spoken up in this
> thread sooner :)


:-) I am fine with it.

>
> So please go ahead and make the changes in the structured parsers
> branch! Then we can merge it soon.
>

O.k. I have started to do that, of course ran into some bug somewhere about an hour ago:-), so I will have to find that one first.

Unfortunately, my time in the US is much more limited than when I am at home, but I will do my best to do that within a foreseeable time...


> Also, since I am already making bold decisions, how about deleting the
> old rdfa 1.0 parser? I cannot see this being maintained now, and you
> said the new one still works for most rdfa 1.0 content?

Yes. Let me offer the following: I would add yet another plugin entry for 'rdfa1.0' that would explicitly start the parser in 1.0 mode. The default for 'rdfa' would then be 1.1. Ie, if somebody has a very very clearly 1.0 content, it can still be parsed. If that is done, I am fine removing the old one.

Ivan Herman

unread,
Oct 8, 2012, 6:03:22 PM10/8/12
to Gunnar Aastrand Grimnes, Dan Brickley, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
Gunnar, others,

I am making steady progress: I am essentially done with what I planned to do. I have, however, a question on the 'style' in RDFLib and how to accommodate this.

First what I did:

- I have now relative path for all modules, no more hack on sys.path. Bye bye Python 2.4 :-)
- I have actually *three* different parsers now. The third one is 'hturtle', meaning extraction of turtle that is embedded in an HTML file as part of a special <script> element, see http://www.w3.org/TR/turtle/#in-html. I had that buried as part of the RDFa parser but, for RDFLib, I thought it is better to separate it as a specific parser for RDLib
- I have also defined a separate RDFa 1.0 parser (which is just a wrapper around the new RDFa parser setting the version explicitly; the user could also do that, but I thought this is just nicer to have).
- Here is how I have set up the various parsers in plugin.py (note that this means the old rdfa parser can be removed, as you suggested):

# The basic parsers: RDFa (by default, 1.1), microdata, and embedded turtle (a.k.a. hturtle)
register('hturtle', Parser, 'rdflib.plugins.parsers.hturtle', 'HTurtleParser')
register('rdfa', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
register('mdata', Parser, 'rdflib.plugins.parsers.structureddata', 'MicrodataParser')
register('microdata', Parser, 'rdflib.plugins.parsers.structureddata', 'MicrodataParser')
# A convencience to use the RDFa 1.0 syntax (although the parse method can be invoked with an rdfa_version keyword, too)
register('rdfa1.0', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFa10Parser')
# Just for the completeness, if the user uses this
register('rdfa1.1', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
# An HTML file may contain microdata, rdfa, turtle. If the user wants them all, the parser below simply invokes all:
register('html', Parser, 'rdflib.plugins.parsers.structureddata', 'StructuredDataParser')
# Some media types are also bound to RDFa
register('application/svg+xml', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
register('application/xhtml+xml', Parser, 'rdflib.plugins.parsers.structureddata', 'RDFaParser')
# 'text/html' media type should be equivalent to html:
register('text/html', Parser, 'rdflib.plugins.parsers.structureddata', 'StructuredDataParser')

(DanBri, what this means is that, for text/html, for example, all structured data will be extracted and smushed together. I hope that is what you would like, right?)

Now for the question.

RDFa 1.1 has a fairly precise notion on what to do with errors. In general, various parser errors (and some other errors in the content) are to be collected into a separate graph called 'processor graph'. There are only some very very rare cases when these errors become really ERROR-s, ie, that it would stop processing. The whole of RDFa1.1 parser but, in fact, the hturtle and the microdata parsers, too, are based on this philosophy; the user can add a separate graph to parsing, e.g.:

g.parse(source="something", format="html", pgraph=Graph())

and the error/warning triples are then added to pgraph. In some cases, of course, nothing will be parsed, or only partial parsing will happen due to problems, but the problems will be added to pgraph if any. If no pgraph is given, then, well, they are all lost. This seems to be in a slight contradiction with the rest of the parsers in RDFLib, which simply run into Python exceptions. So here is the question: what is the preferred approach for these parsers? Some options:

1. keep the behaviour as described above
2. keep this behaviour if the user provides a pgraph; if not then, at the end of the processing, raise an exception with the content (ie, the triples) of the virtual pgraph as an exception value
3. merge an internal pgraph and the user's graph at the end of the processing; ie, ignore the user setting and always expand the graph but do not raise exceptions
4. never accept a pgraph, but rather raise an exception with the error triples if there were any

#1 is in line with the RDFa spec and #2 can also be defended to be fine with it, #3 is a bit pragmatic, #4 is more the current RDFLib way.

Note that when raising an exception the value will be a, say, turtle dump of the whole pgraph, a pretty large exception value:-)

Advice on this? What should be the way to follow? Note that, in fact, I really like the approach of a separate pgraph for all parsers rather than running into Python exceptions, but I guess it is too late to change that...

Apart from that, I would still want to run some more tests, although the core of the parsers are unchanged and have been thoroughly tested (the advantage of not having changed the core code!). But we are almost there, the only coding I would have to do is to settle this error business.

Cheers

Ivan

Gunnar Aastrand Grimnes

unread,
Oct 9, 2012, 2:22:50 AM10/9/12
to rdfli...@googlegroups.com, Dan Brickley, Niklas Lindström, rdf...@googlecode.com
> 1. keep the behaviour as described above
> 2. keep this behaviour if the user provides a pgraph; if not then, at the end of the processing, raise an exception with the content (ie, the triples) of the virtual pgraph as an exception value
> 3. merge an internal pgraph and the user's graph at the end of the processing; ie, ignore the user setting and always expand the graph but do not raise exceptions
> 4. never accept a pgraph, but rather raise an exception with the error triples if there were any
>
> #1 is in line with the RDFa spec and #2 can also be defended to be fine with it, #3 is a bit pragmatic, #4 is more the current RDFLib way.
>
> Note that when raising an exception the value will be a, say, turtle dump of the whole pgraph, a pretty large exception value:-)
>
> Advice on this? What should be the way to follow? Note that, in fact, I really like the approach of a separate pgraph for all parsers rather than running into Python exceptions, but I guess it is too late to change that...


The abort-on-error behaviour of rdflib has annoyed me many times.
Mainly when trying to parse files from dbpedia :)

Maybe it is time to allow a different behaviour - clearly we would
have to keep the current behaviour as default, but we could introduce
a flag

graph.load( source, format, errors= blah )

where blah by default is 'raise' (like today), or 'ignore' (silently),
'warn' (with warnings module) or a graph object, in which case they
are returned?

I could make this change for the other core parsers ...

Does this sound like a good idea to everyone?

An semi-related point - most of the parsers have a streaming interface
interally, i.e. there is a sink object that saves each triple to the
graph. I've also been wanting to expose this stream parsing... maybe
I'll get it done at the same time

- Gunnar
http://gromgull.net

Ivan Herman

unread,
Oct 9, 2012, 5:50:12 PM10/9/12
to rdfli...@googlegroups.com, Dan Brickley, Niklas Lindström, rdf...@googlecode.com
So, I gave some second thoughts. Gunnar, I actually agree with you that this is what should happen, but I also believe that this is something that should be handled on a higher level, not at the level of individual parsers. So here is what I propose to do for RDFa:

- For real errors (eg, XML parsing errors, failed source when trying to get a file, etc) I would raise an exception. Actually, some of those may be caught by the layer above the parser anyway.
- The rests are really warning. Eg, using property="abc:def", where 'abc' is not defined as a prefix and therefore the system considers it as a URI scheme but, to be on the safe side, issues a warning unless 'abc' is a well known, registered URI scheme. These should really really not raise exceptions. So, for those cases, the warnings are collected in a separate, user-given graph or, if none is given, will be ignored.

The hturtle and the microdata cases would do something similar.

Is that o.k.?

Ivan

Ivan Herman

unread,
Oct 10, 2012, 4:23:51 PM10/10/12
to Gunnar Aastrand Grimnes, Dan Brickley, Niklas Lindström, rdf...@googlecode.com, rdfli...@googlegroups.com
Gunnar,

I have pushed all my changes to the repo into the structured_data_parser branch. It now includes RDFa 1.1, microdata, hturtle, with the error management as described below.

I am sure there are bugs (of course:-) but I thought pushing this to the repo is a good step forward; I will still test it in the days to come (time permitting).

I would let you guys take it from here in terms of incorporating it into the main branch at some point, testing your installation procedures (which I never use locally, I must admit, I just store the libraries in PYTHONPATH...), etc.

Cheers

Ivan

Gunnar Aastrand Grimnes

unread,
Oct 23, 2012, 9:05:52 AM10/23/12
to rdfli...@googlegroups.com, Dan Brickley, Niklas Lindström, rdf...@googlecode.com
Looks good Ivan,

From my side, there are two remaining obstacles to merging this into master,

1. I would like to have the rdfa testing sorted out - currently the
old (I am assuming rdfa 1.0?) tests in test/
I guess the rdfa WG has made a new test-suite for rdfa 1.1 (lite) ?

Can we have these included in RDFLib ? I assume the old test-harness
code can be reused.

2. the python 3 status of html5lib - I'd really rather not revert to
an rdflib that does not work on python3.

The comments on this issue

http://code.google.com/p/html5lib/issues/detail?id=187

seems to say that as of october 9. everything was working ok.

Does anyone feel like testing/investigating?

Cheers,

- Gunnar

Ivan Herman

unread,
Oct 23, 2012, 9:59:18 AM10/23/12
to rdfli...@googlegroups.com, Dan Brickley, Niklas Lindström, rdf...@googlecode.com

On Oct 23, 2012, at 09:05 , Gunnar Aastrand Grimnes wrote:

> Looks good Ivan,
>
> From my side, there are two remaining obstacles to merging this into master,
>
> 1. I would like to have the rdfa testing sorted out - currently the
> old (I am assuming rdfa 1.0?) tests in test/
> I guess the rdfa WG has made a new test-suite for rdfa 1.1 (lite) ?
>

Oh yes, there is one on http://rdfa.info/test-suite/; actually, they are on github as part of https://github.com/rdfa/rdfa-website.

But... I have tested my stuff using the online service and not by running them locall. Niklas may be of more help here because, I believe, he extracted the tests for his own implementation.


> Can we have these included in RDFLib ? I assume the old test-harness
> code can be reused.
>

I am not 100% sure. Gregg Kellogg pretty much rewrote the test harness I believe.


> 2. the python 3 status of html5lib - I'd really rather not revert to
> an rdflib that does not work on python3.
>
> The comments on this issue
>
> http://code.google.com/p/html5lib/issues/detail?id=187
>
> seems to say that as of october 9. everything was working ok.
>
> Does anyone feel like testing/investigating?
>


Yeah. I am not sure what to do with this. This is way beyond my sphere of influence.

Ivan

P.S. I must admit that I was always frustrated by python 3. I understand the advantages of python 3 with, say, unicode, but the price the community is paying is huge. I sincerely believe that this was an erroneous decision by the python community but, well, I am just a lambda user...
Reply all
Reply to author
Forward
0 new messages