Beautiful Soup gb2312乱码问题

小皮

unread,

Sep 5, 2009, 7:10:09 AM9/5/09

to python-cn`CPyUG`华蟒用户组(中文Py用户组)

使用BeautifulSoup解析gb2312的网页，指定了编码为gb2312时仍然有部分页面获取到的是乱码，我测试发现返回的
originalEncoding为windows-1252

使用chardet.detect判断出的页面编码为acsii里是正常的，是GB2312时里BeautifulSoup的
UnicodeDammit返回的内容是乱码

网上有一篇文章：BeautifulSoup处理gb2312编码网页的一个bug 的办法好像无效

谁能帮帮我！

Jiahua Huang

unread,

Sep 5, 2009, 7:42:14 AM9/5/09

to pyth...@googlegroups.com

请注意 gb2312 不是 “gb2312”，

凡 gb2312 的请换成 gb18030.

微软将 gb2312 和 gbk 映射为 gb18030，方便了一些人，也迷惑了一些人。

2009/9/5 小皮 <lxb...@gmail.com>

green bai

unread,

Sep 6, 2009, 10:39:12 PM9/6/09

to pyth...@googlegroups.com

烦请发个链接出来。

曾经遇到过，有时候useragent配置会导致网页内容的错误，你或者将得到的网页打印出来看看是不是你要下载的网页。当时我下载到的是雅虎的主页。我无语。

2009/9/5 小皮 <lxb...@gmail.com>

twinsant

unread,

Sep 7, 2009, 2:10:29 AM9/7/09

to pyth...@googlegroups.com

你说的是我写的那篇日记？

汤确实在处理网页时会自动判断编码并试图转到unicode，但有时候会失败然后默认编码就变为1252了。。。你可以去看/跟踪汤的源码。

2009/9/5 小皮 <lxb...@gmail.com>

mmx

unread,

Sep 7, 2009, 11:25:40 PM9/7/09

to pyth...@googlegroups.com

是啊
最少也是gbk了

2009/9/5 Jiahua Huang <jhuang...@gmail.com>

小皮

unread,

Sep 8, 2009, 7:02:39 AM9/8/09

to python-cn`CPyUG`华蟒用户组(中文Py用户组)

BeautifulSoup转码仍然使用的是unicode，按可能编码一次一次的unicode(str, encodeing)如果异常再试下一
种，我在UnicodeDammit里加入了中文可能的编码，仍然有部分页面抓出来后是乱码，即可BeautifulSoup返回
originalEncoding为windows-1252或ISO-8859-2

小皮

unread,

Sep 8, 2009, 7:06:57 AM9/8/09

to python-cn`CPyUG`华蟒用户组(中文Py用户组)

http://www.chinaz.com/Webmaster/RecSite/040a1b02009.html 比如这个页面，我用
BeautifulSoup解析的就是乱码

Jiahua Huang

unread,

Sep 8, 2009, 7:35:34 AM9/8/09

to pyth...@googlegroups.com

你这个页面是 gbk，而他标示为 gb2312。

2009/9/8 小皮 <lxb...@gmail.com>

Jiahua Huang

unread,

Sep 8, 2009, 7:40:38 AM9/8/09

to pyth...@googlegroups.com

不，这个破页面连 gb18030 都不是

$ wget -q -O- http://www.chinaz.com/Webmaster/RecSite/040a1b02009.html | iconv -f gb18030 -t utf8

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<head>

<title>春意盎然踏青去国内旅游web2.0服务推荐_好站推荐_中国站长站 CHINAZ.COM</title>

你没法按常规方法解码，而只能在确定 gb 编码后让 decode 忽略错误

| decode(...)

| S.decode([encoding[,errors]]) -> string or unicode

|

| Decodes S using the codec registered for encoding. encoding defaults

| to the default encoding. errors may be given to set a different error

| handling scheme. Default is 'strict' meaning that encoding errors raise

| a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'

| as well as any other name registerd with codecs.register_error that is

| able to handle UnicodeDecodeErrors.

2009/9/8 Jiahua Huang <jhuang...@gmail.com>

Jiahua Huang

unread,

Sep 8, 2009, 7:43:10 AM9/8/09

to pyth...@googlegroups.com

他应该是 php 之类自动生成 description 时使用了非 mb_ 前缀的不安全安字符函数，使得 gb 编码的中文字符被从字中间截断了，

变成非法编码（怕错误截断也是 gb 编码的缺陷）

On Tue, Sep 8, 2009 at 7:40 PM, Jiahua Huang <jhuang...@gmail.com> wrote:

不，这个破页面连 gb18030 都不是

green bai

unread,

Sep 9, 2009, 2:12:42 AM9/9/09

to pyth...@googlegroups.com

s = s.decode("gb2312","ignore")

sp = BeautifulSoup(s)

你讲unicode传进去即可。你给的这个网页我这么测试没问题。

2009/9/8 小皮 <lxb...@gmail.com>

Jiahua Huang

unread,

Sep 9, 2009, 2:59:57 AM9/9/09

to pyth...@googlegroups.com

说了凡 gb2312 都请换成 gbk 或 gb18030

2009/9/9 green bai <baigr...@gmail.com>

Reply all

Reply to author

Forward