rdfa support

68 views
Skip to first unread message

Ed Summers

unread,
Jun 11, 2013, 10:06:03 AM6/11/13
to rdfli...@googlegroups.com
I just happened to notice that an RDFa page from the BBC

http://www.bbc.co.uk/news/world-us-canada-22857062

which used to parse fine with rdflib v3.3 [1] now fails to parse with
v4.1 [2]. Have regressions in RDFa parsing been ticketed already?
Perhaps I'm doing something wrong here?

//Ed

[1] https://gist.github.com/edsu/5757032
[2] https://gist.github.com/edsu/5757093

ch...@improbable.org

unread,
Jun 11, 2013, 10:18:47 AM6/11/13
to rdfli...@googlegroups.com
I'm actually having trouble parsing data from URLs even with 3.4:

https://gist.github.com/acdha/4ba5286f9f872f771ea6

All of these tests were in a clean virtualenv using Python 2.7 on OS X 10.8.

Chris

Ed Summers

unread,
Jun 11, 2013, 10:20:32 AM6/11/13
to rdfli...@googlegroups.com
Hmm, seems like it might be related to the version of html5lib you are using? I just back pedaled to html5lib to v0.95 and the page works fine with rdflib v4.1.

    pip install html5lib==0.95

//Ed

Chris Adams

unread,
Jun 11, 2013, 10:25:53 AM6/11/13
to rdfli...@googlegroups.com
This worked for me as well. I'm assuming this means there's some unicode-safety changes in html5lib.

Ed Summers

unread,
Jun 11, 2013, 11:08:17 AM6/11/13
to rdfli...@googlegroups.com
I guess this isn't caught by the rdflib test suite since it seems to skip all the rdfa tests ... :-(

//Ed

Ivan Herman

unread,
Jun 12, 2013, 10:16:57 AM6/12/13
to rdfli...@googlegroups.com


Chris Adams wrote:
> This worked for me as well. I'm assuming this means there's some unicode-safety
> changes in html5lib.

Chris, can you tell me a bit more about the issue with html5lib, ie, what you
believe it does? I pretty much used that lib as a black box for the RDFa
processing and it is really bad news it if becomes a problem...

(Just back from a long far-Eastern trip, could not check this...)

Thanks

Ivan

>
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/dc39076d-0821-47f8-a813-614f7d512db1%40googlegroups.com?hl=en-US.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Gunnar Aastrand Grimnes

unread,
Jun 15, 2013, 3:21:27 AM6/15/13
to rdfli...@googlegroups.com
Sorry for the late reply to this (and issues), I was on holiday
without internet for a week.

We upgraded to the new html5lib, since it has py3 support. The
changelog for html5lib doesn't say anything about unicode changes, but
py3 porting may well have introduced some.

On 11 June 2013 17:08, Ed Summers <e...@pobox.com> wrote:
> I guess this isn't caught by the rdflib test suite since it seems to skip
> all the rdfa tests ... :-(

https://github.com/RDFLib/rdflib/issues/304

- Gunnar


--
http://gromgull.net

Ivan Herman

unread,
Jun 17, 2013, 7:33:31 AM6/17/13
to Gunnar Aastrand Grimnes, rdfli...@googlegroups.com, e...@pobox.com
(trying to bind all the different threads:-)

Well... this seems to be an html5lib error...

I created a small python program, trying to do what the RDFa parser does:

[[[
import sys
#sys.path.insert(0,"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1")

import html5lib
from urllib2 import Request, urlopen

req = Request(url='http://www.bbc.co.uk/news/world-us-canada-22857062')
data = urlopen(req)

parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
dom = parser.parse(input)
print dom
]]]

if this is run with an older version of the html5lib, then things are fine and a
dom tree is created. If it is run with the latest version of html5lib (on my
local machine: removing the comment) then I get an exception:

[[[
Traceback (most recent call last):
File "htmlbug.py", line 12, in <module>
dom = parser.parse(input)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/html5parser.py",
line 223, in parse
parseMeta=parseMeta, useChardet=useChardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/html5parser.py",
line 87, in _parse
parser=self, **kwargs)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/tokenizer.py",
line 40, in __init__
self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 132, in HTMLInputStream
return HTMLBinaryInputStream(source, encoding, parseMeta, chardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 394, in __init__
self.rawStream = self.openStream(source)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 431, in openStream
stream = BytesIO(source)
TypeError: 'builtin_function_or_method' does not have the buffer interface
]]]

I am not sure what that exception means. I presume one should report that back
to the html5lib developers, but if anybody could run the same to be sure that
there is indeed a bug...

Note that if the bbc file is copied to a local file then parsing works properly.
It seems to have something to do with the HTTP return headers, but I do not know
why.

:-(

Anybody has a good idea here?

Ivan


Gunnar Aastrand Grimnes wrote:
> Here in the original email Ed had an example uri that failed.
> (I've not tried it)
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to rdflib-dev+...@googlegroups.com
> <mailto:rdflib-dev%2Bunsu...@googlegroups.com>.
> To post to this group, send email to rdfli...@googlegroups.com
> <mailto:rdfli...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/CABzDd%3D4ui2-P%3DcWN%2BofbmSfaeXigAUFoO1uw%2BF7VfHm8Z-74rQ%40mail.gmail.com?hl=en-US.

Ed Summers

unread,
Jun 17, 2013, 8:47:37 AM6/17/13
to rdfli...@googlegroups.com
I think you want to try to change:

dom = parser.parse(input)

to:

dom = parser.parse(data)

It still doesn't work, but the exception is a bit more useful.

//Ed
> To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/51BEF40B.4090502%40gmail.com.

Ivan Herman

unread,
Jun 17, 2013, 8:52:58 AM6/17/13
to rdfli...@googlegroups.com
Oops, sorry, I made mistake when finalizing my message!

Indeed. It should be data, and the exception says:

[[[

14:50 tmp> python htmlbug.py
Traceback (most recent call last):
File "htmlbug.py", line 12, in <module>
dom = parser.parse(data)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/html5parser.py",
line 223, in parse
parseMeta=parseMeta, useChardet=useChardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/html5parser.py",
line 87, in _parse
parser=self, **kwargs)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/tokenizer.py",
line 40, in __init__
self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 132, in HTMLInputStream
return HTMLBinaryInputStream(source, encoding, parseMeta, chardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 411, in __init__
self.charEncoding = self.detectEncoding(parseMeta, chardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 448, in detectEncoding
encoding = self.detectEncodingMeta()
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 535, in detectEncodingMeta
assert isinstance(buffer, bytes)
AssertionError
]]]

which shows that it is related to encoding issues through some mysterious ways...

Ivan

Ed Summers

unread,
Jun 17, 2013, 9:11:19 AM6/17/13
to rdfli...@googlegroups.com
This example, pulled from the html5lib doc page [1], seems to work ok
for me with html5lib v1.0b1

https://gist.github.com/edsu/5796610

//Ed

[1] https://github.com/html5lib/html5lib-python

Ivan Herman

unread,
Jun 17, 2013, 9:45:31 AM6/17/13
to rdfli...@googlegroups.com
Which worries me. What this seems to indicate that the html5lib needs its own
URL open mechanism, and it does not (always) work with the method provided by
the python libraries. On the other hand, I rely on the python libraries to get
information on the http return header entries...

One way would be to get the header data, *close* the URI and then let the
HTML5Lib do its own thing again. But that is crazy, doing a double access for
the same content:-(

Ivan

Ed Summers

unread,
Jun 17, 2013, 10:45:28 AM6/17/13
to rdfli...@googlegroups.com
I'm sure there's a way to get it to work. Personally I would like to
see rdflib's RDFa test suite failing right now...

//Ed
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/51BF12FB.20401%40gmail.com.

Ivan herman

unread,
Jun 17, 2013, 1:23:18 PM6/17/13
to rdfli...@googlegroups.com, rdfli...@googlegroups.com
On 17 Jun 2013, at 16:45, Ed Summers <e...@pobox.com> wrote:

> I'm sure there's a way to get it to work.

I hope. But it may be much better to send this as a bug report to the lib maintainers... Unless somebody on this list has a brilliant idea, I am ok doing it myself.

(Well, I do have one avenue I may try if I have the time. The issue seems to be around character setting. If the same code works by setting the charset to, say, utf-8 manually, then one could do the sniffing directly, instead leaving that to the library. This is just an idea I got right now, to be tested... at some point hopefully this week. Although this would just mean getting around a bug, which is not a healthy thing to do...)

> Personally I would like to
> see rdflib's RDFa test suite failing right now...

I am not sure I understand what you mean:-(

Ivan
> To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CABzDd%3D5m6HMz_d-ncpgD4tUUgVcFmaDqQz4mdWjCNCVw4o9%2B_A%40mail.gmail.com.

Niklas Lindström

unread,
Jun 17, 2013, 4:49:07 PM6/17/13
to rdfli...@googlegroups.com
Hi all,

There is a bug in html5lib 1.0b1. When html5lib.inputstream.HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a html5lib.inputstream.BufferedStream.

At first glance, this looks innocent enough. However, the code in html5lib/inputstream.py starts with this:

    from __future__ import absolute_import, division, unicode_literals

Notice that it imports unicode_literals, meaning that every string literal in that code file is interepreted a unicode literal (as in Python 3). Now, in the BufferedStream class in that file, the _readFromBuffer method ends with returning the joining of a list, using: a string literal! Thus, every read from it after the first will result in a chunk of unicode instead of a chunk of bytes (i.e. a raw string in Python 2.6+, where bytes is str). This obviously fails at line 535, with the `assert isinstance(buffer, bytes)`

I've reported this bug to html5lib-python, in issue 67: https://github.com/html5lib/html5lib-python/issues/67

Meanwhile, unless this is promptly fixed and a new version of html5lib is released, a new version of RDFLib should probably be released with the dependency of html5lib locked to 0.95 for all versions of Python 2 (not only < 2.6). (And kept like that until this bug is fixed, or maybe better until html5lib 1.0 final is released.)

Cheers,
Niklas




Niklas Lindström

unread,
Jun 17, 2013, 4:57:48 PM6/17/13
to rdfli...@googlegroups.com
Oh; an alternative is to have the RDFa parser read in the entire response as a string first, and then pass that to the html5lib parser. Just like in Ed's example from the html5lib docs. Slightly unoptimal of course, and using streams is more idiomatic in general and in RDFLib, I think. So we might want to revert such a fix after html5lib is fixed anyway.

Cheers,
Niklas

Ivan Herman

unread,
Jun 18, 2013, 2:29:34 AM6/18/13
to rdfli...@googlegroups.com
Niklas,

(this is a common answer to this and the previous mail...)

First of all: thanks, you are great (as usual, I might add:-). It is then clear
that there is a documented bug, as suspected, in the 1.0b1 version of html5lib.

I presume what you mean as a solution is that we should read the full content
into some sort of an internal buffer (well, a StringIO instance) and then give
that to the parser. This is circumventing a (hopefully) temporary bug, creating
a temporary solution that we would have to backpedal later; I think that would
be a bad engineering practice. Have you any feedback on how frequently html5lib
is updated and bugs treated? The numbering of the version suggests that this is
a beta release which should be followed by a 'real' 1.0 release; this should not
take ages.

My proposal would be to leave things as they are, and we should document that,
for the time being, RDFLib depends on version 0.95, and leave it at that. Once
html5lib 1.0 comes out, we can test again and, if the bug is handled, that is
the end of the story. If they do not handle that bug, we can still come back to
do this... hack:-)

Thanks!

Ivan
> <mailto:rdflib-dev%2Bunsu...@googlegroups.com>.
> >> To post to this group, send email to rdfli...@googlegroups.com
> <mailto:rdfli...@googlegroups.com>.
> >> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/51BF12FB.20401%40gmail.com.
> >> For more options, visit https://groups.google.com/groups/opt_out.
> >>
> >>
> >
> > --
> > http://github.com/RDFLib
> > ---
> > You received this message because you are subscribed to the Google
> Groups "rdflib-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to rdflib-dev+...@googlegroups.com
> <mailto:rdflib-dev%2Bunsu...@googlegroups.com>.
> > To post to this group, send email to rdfli...@googlegroups.com
> <mailto:rdfli...@googlegroups.com>.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/CABzDd%3D5m6HMz_d-ncpgD4tUUgVcFmaDqQz4mdWjCNCVw4o9%2B_A%40mail.gmail.com.
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
>
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google
> Groups "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to rdflib-dev+...@googlegroups.com
> <mailto:rdflib-dev%2Bunsu...@googlegroups.com>.
> To post to this group, send email to rdfli...@googlegroups.com
> <mailto:rdfli...@googlegroups.com>.
> --
> http://github.com/RDFLib
> ---
> You received this message because you are subscribed to the Google Groups
> "rdflib-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to rdflib-dev+...@googlegroups.com.
> To post to this group, send email to rdfli...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rdflib-dev/CADjV5jdFVQ1UunfUNT8ekTuHMHHynj2tO1%3DfRPqoKg9nV09Nng%40mail.gmail.com.

Gunnar Aastrand Grimnes

unread,
Jun 18, 2013, 2:38:41 AM6/18/13
to rdfli...@googlegroups.com
Good detective work Niklas!

So in conclusion:

The "from future import unicode_literals" only work in 2.6 and above,
we also try to support 2.5, which will need to be pinned at 0.95 of
html5lib "forever"

html5libs pre 1.0X did not support python 3, so py3 will have to stay
with 1.0 and hope the bug gets fixed soon.

For 2.6 and 2.7 we should use 0.95 for now, and upgrade once bug is fixed.

?

Cheers,

- Gunnar
>> On Mon, Jun 17, 2013 at 10:49 PM, Niklas Lindström <linds...@gmail.com
> To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/51BFFE4E.7000809%40gmail.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Ivan Herman

unread,
Jun 18, 2013, 2:50:13 AM6/18/13
to rdfli...@googlegroups.com


Gunnar Aastrand Grimnes wrote:
> Good detective work Niklas!
>
> So in conclusion:
>
> The "from future import unicode_literals" only work in 2.6 and above,
> we also try to support 2.5, which will need to be pinned at 0.95 of
> html5lib "forever"
>
> html5libs pre 1.0X did not support python 3, so py3 will have to stay
> with 1.0 and hope the bug gets fixed soon.
>
> For 2.6 and 2.7 we should use 0.95 for now, and upgrade once bug is fixed.
>
> ?
>

Agreed!

Ivan
>>> On Mon, Jun 17, 2013 at 10:49 PM, Niklas Lindstr�m <linds...@gmail.com

Ed Summers

unread,
Jun 18, 2013, 3:58:01 AM6/18/13
to rdfli...@googlegroups.com
It would be nice to have the RDFa tests enabled as well, so broken
RDFa support doesn't need to be discovered by an rdflib user and
debated on the discussion list in the future.

//Ed
>>>> On Mon, Jun 17, 2013 at 10:49 PM, Niklas Lindström <linds...@gmail.com
> To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/51C00325.9030900%40gmail.com.

Gunnar Aastrand Grimnes

unread,
Jun 18, 2013, 3:59:07 AM6/18/13
to rdfli...@googlegroups.com
agreed! :)

I will update and re-enable the tests soonishly!

- Gunnar
> To view this discussion on the web visit https://groups.google.com/d/msgid/rdflib-dev/CABzDd%3D4yzGqKGhkONCepJuOZuV_XAosQr-Os%3DYoTCf1QLB5o%2Bw%40mail.gmail.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://gromgull.net

Dan Scott

unread,
Feb 26, 2014, 2:32:10 PM2/26/14
to rdfli...@googlegroups.com
On Tuesday, June 18, 2013 3:59:07 AM UTC-4, Gunnar Aastrand Grimnes wrote:
agreed! :)

I will update and re-enable the tests soonishly!

- Gunnar

On 18 June 2013 09:58, Ed Summers <e...@pobox.com> wrote:
> It would be nice to have the RDFa tests enabled as well, so broken
> RDFa support doesn't need to be discovered by an rdflib user and
> debated on the discussion list in the future.

Having just gone through the process of trying to update Fedora Linux's python-rdflib packages from 3.2.3 to 4.1.0 and rediscovering this thread, I can:

1. Note that html5lib released an update (1.0b3) that incorporated Niklas's fix back in July 2013; they're now up to release 0.999 (having reverted their versioning scheme because of a pypi requirement)
2. Confirm that running RDFLib 4.1.0 with html5lib 1.0b2 (Fedora's currently packaged version) was horribly broken
3. Confirm that running RDFLib 4.1.0 with html5lib 0.999 fixes RDFa parsing

I have opened a few RDFLib pull requests based on my testing, but the primary one is to remove the html5lib pinning now that 1.0b3 and 0.99+ resolve the problem and hopefully get a new RDFLib bugfix release out real soon now :)

With any luck, Fedora will have up-to-date packages in the near future as well.

Ivan herman

unread,
Feb 27, 2014, 2:54:43 AM2/27/14
to rdfli...@googlegroups.com, rdfli...@googlegroups.com
Thanks for this Scott, I did know about the latest HTML parser being out; will try to test it soon, and I will update the documentation on the main repo...

Thanks again

Ivan

----
Ivan Herman

(Written on my mobile. Excuses for brevity and frequent misspellings...)


--
http://github.com/RDFLib
---
You received this message because you are subscribed to the Google Groups "rdflib-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdflib-dev+...@googlegroups.com.
To post to this group, send email to rdfli...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages