BeautifulSoup 4.5.0 'LXMLTreeBuilder' object has no attribute 'processing_instruction_class'

521 views
Skip to first unread message

liu...@1for.one

unread,
Jul 26, 2016, 8:48:18 PM7/26/16
to beautifulsoup
This happens when I try to convert String to Soup Object.  Works fine when it's 4.4.1.

leonardr

unread,
Jul 26, 2016, 9:46:48 PM7/26/16
to beautifulsoup

This happens when I try to convert String to Soup Object.  Works fine when it's 4.4.1.

Can you share the markup and the code you use? This case is covered in the test suite and I can only duplicate the error  by removing code.

Also please let me know which version of Python you're using.

If possible, run the unit test suite and see if it passes on your computer. But what I need most is your markup and code.

Leonard

Faaken Naame

unread,
Jul 29, 2016, 6:20:53 AM7/29/16
to beautifulsoup
Ok, I have a minimum working example.


Crashing script:
import bs4


page = '''
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head>
</head>
<body>
</body>
</html>

'''

bs4.BeautifulSoup(page, "lxml")


Yes, this is actually html, I have no idea why it has the 
<?xml version="1.0"?>
in it.

Platform details:

durr@rwpscrape:/media/Storage/Scripts/ReadableWebProxy python3
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform
>>> platform.platform()
'Linux-4.4.0-24-generic-x86_64-with-Ubuntu-16.04-xenial'
>>> platform.python_version()
'3.5.1+'
>>> platform.uname()
uname_result
(system='Linux', node='rwpscrape', release='4.4.0-24-generic', version='#43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016', machine='x86_64', processor='x86_64')
>>> platform.linux_distribution()
('Ubuntu', '16.04', 'xenial')
>>> import bs4
>>> bs4.__version__
'4.5.0'
>>> from lxml import etree
>>> etree.LXML_VERSION
(3, 6, 1, 0)

On the same platform, after fetching the source tarball from https://www.crummy.com/software/BeautifulSoup/bs4/download/4.5/beautifulsoup4-4.5.0.tar.gz and running 2to3:
durr@rwpscrape:~/beautifulsoup4-4.5.0 python3 -m unittest discover -s bs4
......................................................................................................................./home/durr/beautifulsoup4-4.5.0/bs4/builder/_lxml.py:247: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
 
self.parser.feed(markup)
....../home/durr/beautifulsoup4-4.5.0/bs4/builder/_lxml.py:247: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
 
self.parser.feed(markup)
............................................/home/durr/beautifulsoup4-4.5.0/bs4/builder/_lxml.py:130: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
 
self.parser.feed(data)
...................................................................................................................................................................................................................................................../home/durr/beautifulsoup4-4.5.0/bs4/builder/_lxml.py:130: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
 
self.parser.feed(data)
..................../home/durr/beautifulsoup4-4.5.0/bs4/builder/_lxml.py:247: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
 
self.parser.feed(markup)
..................
----------------------------------------------------------------------
Ran 452 tests in 0.657s


OK






 

MB

unread,
Jul 29, 2016, 1:15:56 PM7/29/16
to beautifulsoup
import bs4


page = '''
<?xml version="1.0" encoding="utf-8"?>
<Index>
  <version>2016</version>
  <title>Fake data</title>
  <letter>
    <title>A</title>
    <mainTerm>
      <title>Aardvark</title>
      <code>A11</code>
    </mainTerm>
    </letter>
</Index>

'''

bs4.BeautifulSoup(page)

  Traceback (most recent call last):
  File "C:/projects/bs4test.py", line 20, in <module>
    bs4.BeautifulSoup(page)
File "C:\Python35\lib\site-packages\bs4\__init__.py", line 228, in __init__
    self._feed()
  File "C:\Python35\lib\site-packages\bs4\__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "C:\Python35\lib\site-packages\bs4\builder\_lxml.py", line 247, in feed
    self.parser.feed(markup)
  File "src\lxml\parser.pxi", line 1205, in lxml.etree._FeedParser.feed (src\lxml\lxml.etree.c:110709)
  File "src\lxml\parser.pxi", line 1327, in lxml.etree._FeedParser.feed (src\lxml\lxml.etree.c:110583)
  File "src\lxml\parsertarget.pxi", line 141, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:126930)
  File "src\lxml\parsertarget.pxi", line 135, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:126799)
  File "src\lxml\lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:10789)
  File "src\lxml\saxparser.pxi", line 549, in lxml.etree._handleSaxPI (src\lxml\lxml.etree.c:121866)
  File "src\lxml\parsertarget.pxi", line 94, in lxml.etree._PythonSaxParserTarget._handleSaxPi (src\lxml\lxml.etree.c:126297)
  File "C:\Python35\lib\site-packages\bs4\builder\_lxml.py", line 211, in pi
    self.soup.endData(self.processing_instruction_class)
AttributeError: 'LXMLTreeBuilder' object has no attribute 'processing_instruction_class' 
 


C:\Users\M>python35
Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:16:59) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> bs4.__version__
'4.5.0'
>>> from lxml import etree
>>> etree.LXML_VERSION
(3, 6, 1, 0)
>>> exit()
 


C:\Users\M>python35 -m unittest discover -s bs4
.............................................................C:\Python35\lib\site-packages\bs4\builder\_lxml.py:247: Dep
recationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  self.parser.feed(markup)
......C:\Python35\lib\site-packages\bs4\builder\_lxml.py:247: DeprecationWarning: inspect.getargspec() is deprecated, us
e inspect.signature() instead
  self.parser.feed(markup)
............................................C:\Python35\lib\site-packages\bs4\builder\_lxml.py:130: DeprecationWarning:
inspect.getargspec() is deprecated, use inspect.signature() instead
  self.parser.feed(data)
.................C:\Python35\lib\genericpath.py:19: DeprecationWarning: The Windows bytes API has been deprecated, use U
nicode filenames instead
  os.stat(path)
........................................................................................................................
............................................................................................................C:\Python35\
lib\site-packages\bs4\builder\_lxml.py:130: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signatur
e() instead
  self.parser.feed(data)
....................C:\Python35\lib\site-packages\bs4\builder\_lxml.py:247: DeprecationWarning: inspect.getargspec() is
deprecated, use inspect.signature() instead
  self.parser.feed(markup)
..................
----------------------------------------------------------------------
Ran 394 tests in 0.365s

OK
 

Faaken Naame

unread,
Jul 30, 2016, 3:00:46 AM7/30/16
to beautifulsoup
My spider ran a while longer, I have a number of more replicating input. I've worked around the issue described above (the apparently spurious XML tag) by just removing it from the input with string munging.

<html>
<body>
<I>Wut<?I>
</body>
</html>


<html>
<body>
<B>
<CENTER>
Act Two<br  />
<br  />
~~~~~~~~~~~~~~~ 
<?CENTER>
</B>
</body>
</html>

<?I>

A case of incorrect PHP:
<html lang="en-US">
<body>
<?php
?>
</body>
</html>



In any event, the common component across all these "bad" input pages are incorrect HTML tags starting with "<?". In many of the pages where I ran into this issue, it appears to be due to freeform user input of HTML, together with a slightly overactive shift-key finger. The other common cause, aside from the XML declaration is broken PHP that's accidentally making it into the output.

leonardr

unread,
Jul 30, 2016, 7:59:25 AM7/30/16
to beautifulsoup
Thanks for helping me duplicate the bug. I've filed https://bugs.launchpad.net/beautifulsoup/+bug/1608048 to track the work and have fixed the bug. The fix will be in the next release and a patch is attached.

Leonard

1608048.diff

Faaken Naame

unread,
Jul 30, 2016, 7:04:06 PM7/30/16
to beautifulsoup
Sweet! Any idea when this'll make it out to PyPi?

MB

unread,
Aug 2, 2016, 12:00:31 PM8/2/16
to beautifulsoup
Thanks for the patch.

liu...@jobsonic.com

unread,
Aug 2, 2016, 7:29:43 PM8/2/16
to beautifulsoup
Mine is python 2.7.11.  It's not only a python 3 Issue I think.


On Saturday, July 30, 2016 at 4:59:25 AM UTC-7, leonardr wrote:
Reply all
Reply to author
Forward
0 new messages