beautifulsoup解析baidu首页失败了

dddpppbox

unread,

May 2, 2009, 4:10:34 AM5/2/09

to python-cn`CPyUG`华蟒用户组

3.07a与3.1.0.1都用过了,都失败了.郁闷

Cliff Peng

unread,

May 2, 2009, 4:27:30 AM5/2/09

to pyth...@googlegroups.com

2009/5/2 dddpppbox <dddp...@gmail.com>

3.07a与3.1.0.1都用过了,都失败了.郁闷

是简单的感叹还是需要帮忙，需要帮忙的话贴出错代码吧

dddpppbox

unread,

May 2, 2009, 5:11:20 AM5/2/09

to python-cn`CPyUG`华蟒用户组

那帮我看看吧,
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> baidu=urllib2.urlopen("http://www.baidu.com")
>>> soup=BeautifulSoup(baidu.read())

Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
soup=BeautifulSoup(baidu.read())
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
__init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
__init__
self._feed(isHTML=isHTML)
File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
_feed
self.builder.feed(markup)
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\HTMLParser.py", line 263, in parse_starttag
% (rawdata[k:endpos][:20],))
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: junk characters in start tag: u'\u767e
\u5ea6\u4e00\u4e0b id=sb>', at line 3, column 201

On 5月2日, 下午4时27分, Cliff Peng <szhai...@gmail.com> wrote:
> 2009/5/2 dddpppbox <dddppp...@gmail.com>
>
> > 3.07a与3.1.0.1都用过了,都失败了.郁闷
>
> 是简单的感叹还是需要帮忙，需要帮忙的话贴出错代码吧
>
>

dddpppbox

unread,

May 2, 2009, 5:13:14 AM5/2/09

to python-cn`CPyUG`华蟒用户组

python版本是2.6.2

Mr Shore

unread,

May 2, 2009, 5:15:33 AM5/2/09

to pyth...@googlegroups.com

这东西，我觉得，掌握了RE就没必要存在了

2009/5/2 dddpppbox <dddp...@gmail.com>

--
http://maishudi.com
找工作，找解决方案，找同行，找朋友，
您贴身的综合服务平台，关注您的关注.

shell909090

unread,

May 2, 2009, 5:26:48 AM5/2/09

to pyth...@googlegroups.com

Mr Shore 写道:

有的时候还是有存在价值的。
其实更好和下载组件连起来，让客户代码变成被调用者。
有的时候网页复杂，想三分钟拿下，用BS加下载组件的现成构架绝对速度一流。
当然，适用性，性能啥的就别强求了，尤其是CPU使用和执行速度。

signature.asc

Mr Shore

unread,

May 2, 2009, 5:50:05 AM5/2/09

to pyth...@googlegroups.com

用不用它都能3分钟拿下吧-_-b

2009/5/2 shell909090 <shell...@gmail.com>

Cliff Peng

unread,

May 2, 2009, 6:55:02 AM5/2/09

to pyth...@googlegroups.com

2009/5/2 dddpppbox <dddp...@gmail.com>

出错的 HTML 代码是：

<input type=submit value=百度一下 id=sb>


很常见的问题，标准的HTML应该是：<input type="submit" value="百度一下" id="sb">


，不知道是百度出于压缩考虑，还是它的技术人员偷懒

应该用 tidy 之类的工具修补下，然后再用 BeautifulSoup 处理

不过正如之前我那条不被人关注的帖子中所提到的，原有多个对 tidy 的 python 封装都不好用

张教主推荐了个自己封装的 tidy ，可以试下

Message has been deleted

dddpppbox

unread,

May 2, 2009, 8:01:11 AM5/2/09

to python-cn`CPyUG`华蟒用户组

thx,明白了
这种不标准的网页比例太高的话,beautifulSoup的实用性就降低了

刚刚在转向SGMLlib~

Cliff Peng

unread,

May 2, 2009, 8:40:53 AM5/2/09

to pyth...@googlegroups.com

2009/5/2 dddpppbox <dddp...@gmail.com>

thx,明白了
这种不标准的网页比例太高的话,beautifulSoup的实用性就降低了

刚刚在转向SGMLlib~

转回 SGMLlib 就没有必要了，这不是因噎废食吗？

tidy 修补 HTML 的效率非常高，只不过目前没有太合适的 Python 封装（张教主的我还没有下载到）

张沈鹏

unread,

May 2, 2009, 11:12:08 AM5/2/09

to pyth...@googlegroups.com

不过tidy不稳定反复使用会导致进程崩溃

长时间使用会导致进程crash，没有找到原因。

不过可以干脆用原始的版本,搞一个进程外调用吧

wget http://nchc.dl.sourceforge.net/sourceforge/tidy/tidy4aug00.tgz

然后安装,然后

from __future__ import with_statement
import subprocess
import os

def tidy(html):
with os.tmpfile() as temp:
with open(os.devnull,"w" ) as null:
print >>temp,html
temp.seek(0)
html=subprocess.Popen(
["tidy", "-utf8","-asxhtml"],
stdin=temp,
stderr=null,
stdout=subprocess.PIPE
).communicate()[0]
begin="<body>"
return html[html.find(begin)+len(begin):html.rfind("</body>")].strip()

tidy("<div>x<a>a")

Cliff Peng

unread,

May 2, 2009, 11:24:02 AM5/2/09

to pyth...@googlegroups.com

2009/5/2 张沈鹏 <zsp...@gmail.com>

不过tidy不稳定反复使用会导致进程崩溃

这个不稳定是不是指的 tidylib ?

超强，教主功夫果然已经出神入化，实在佩服

勉强看懂一二，准备测试下

Cliff Peng

unread,

May 3, 2009, 11:31:39 AM5/3/09

to pyth...@googlegroups.com

2009/5/2 Cliff Peng <szha...@gmail.com>

张教主的封装确实好用，我在 win32 环境测试通过

dddpppbox

unread,

May 3, 2009, 8:00:51 PM5/3/09

to python-cn`CPyUG`华蟒用户组(中文Py用户组)

学习

On 5月2日, 下午11时12分, 张沈鹏 <zsp...@gmail.com> wrote:
> 不过tidy不稳定反复使用会导致进程崩溃
>
> 长时间使用会导致进程crash，没有找到原因。
>
> 不过可以干脆用原始的版本,搞一个进程外调用吧
>

> wgethttp://nchc.dl.sourceforge.net/sourceforge/tidy/tidy4aug00.tgz

feng

unread,

May 6, 2009, 5:56:06 AM5/6/09

to python-cn`CPyUG`华蟒用户组(中文Py用户组)

自己写的东东不怕那些不规范的网页，比较像sgmllib。 http://code.google.com/p/tagparser/

Reply all

Reply to author

Forward