(trying to bind all the different threads:-)
Well... this seems to be an html5lib error...
I created a small python program, trying to do what the RDFa parser does:
[[[
import sys
#sys.path.insert(0,"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1")
import html5lib
from urllib2 import Request, urlopen
req = Request(url='
http://www.bbc.co.uk/news/world-us-canada-22857062')
data = urlopen(req)
parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
dom = parser.parse(input)
print dom
]]]
if this is run with an older version of the html5lib, then things are fine and a
dom tree is created. If it is run with the latest version of html5lib (on my
local machine: removing the comment) then I get an exception:
[[[
Traceback (most recent call last):
File "htmlbug.py", line 12, in <module>
dom = parser.parse(input)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/html5parser.py",
line 223, in parse
parseMeta=parseMeta, useChardet=useChardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/html5parser.py",
line 87, in _parse
parser=self, **kwargs)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/tokenizer.py",
line 40, in __init__
self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 132, in HTMLInputStream
return HTMLBinaryInputStream(source, encoding, parseMeta, chardet)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 394, in __init__
self.rawStream = self.openStream(source)
File
"/Users/ivan/Source/PythonModules/html5lib-repo/html5lib-1.0b1/html5lib/inputstream.py",
line 431, in openStream
stream = BytesIO(source)
TypeError: 'builtin_function_or_method' does not have the buffer interface
]]]
I am not sure what that exception means. I presume one should report that back
to the html5lib developers, but if anybody could run the same to be sure that
there is indeed a bug...
Note that if the bbc file is copied to a local file then parsing works properly.
It seems to have something to do with the HTTP return headers, but I do not know
why.
:-(
Anybody has a good idea here?
Ivan
Gunnar Aastrand Grimnes wrote:
> Here in the original email Ed had an example uri that failed.
> (I've not tried it)
> <mailto:
rdflib-dev%2Bunsu...@googlegroups.com>.
> <mailto:
rdfli...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/rdflib-dev/CABzDd%3D4ui2-P%3DcWN%2BofbmSfaeXigAUFoO1uw%2BF7VfHm8Z-74rQ%40mail.gmail.com?hl=en-US.